Noor-Vajeh: A Benchmark Dataset for Keyword Extraction from Persian Papers

Taheri, Mohammad Amin; Shenassa, Mohammad Ebrahim; Minaei-Bidgoli, Behrouz; Hossayni, Sayyed Ali

doi:10.61186/jsdp.21.4.113

Volume 21, Issue 4 (3-2025) JSDP 2025, 21(4): 113-123 | Back to browse issues page

‎ 10.61186/jsdp.21.4.113

Mendeley

Zotero

RefWorks

Taheri M A, Shenassa M E, Minaei-Bidgoli B, Hossayni S A. Noor-Vajeh: A Benchmark Dataset for Keyword Extraction from Persian Papers. JSDP 2025; 21 (4) : 8
URL: http://jsdp.rcisp.ac.ir/article-1-1340-en.html

Noor-Vajeh: A Benchmark Dataset for Keyword Extraction from Persian Papers

Mohammad Amin Taheri ^*

, Mohammad Ebrahim Shenassa

, Behrouz Minaei-Bidgoli

, Sayyed Ali Hossayni

& Master Student, Faculty of Computer Engineering, Technology and Science of University, Tehran, Iran

Abstract: (506 Views)

There are various ways to express the overall intention and focal points of a text, and keywords might be the most appropriate choice. Keywords are defined as the most prominent phrases in a document that convey its main message. By extracting relevant words and phrases from a text, keyword extraction can help to uncover meaningful patterns in the text and provide an overview of the content. It can also help highlight the most significant concepts in a text and focus the attention of a machine learning algorithm on them.
Keyword extraction is an imperative subtask of natural language processing. By reducing the complexity of the text and making it easier to process, keyword extraction can be used as the basis for many other processing tasks such as text classification, clustering, and summarization. By extracting keywords from text, a machine can better understand the meaning and context of the text. This enables it to better analyze the text, recognize patterns, and make more accurate decisions. It can also reduce the amount of time it takes to process the text by eliminating unnecessary words and focusing on the most important words. Many datasets are proposed for evaluating keyword extraction methods in Persian, most of which only contain authors’ keywords and do not cover all potential ones. Thus, using such datasets leads to incorrect judgments about the accuracy of the suggested supervised and unsupervised methods.
In this paper, we introduce Noor-Vajeh, a Persian keyword extraction dataset of about 1400 scientific papers. We asked experts to extract potential keywords besides the authors’ keywords to complete the keywords set for each article. The resulting dataset is a valuable resource for ongoing research into Persian keyword extraction. To evaluate the dataset to be used as a benchmark, we tested several unsupervised keyword extraction methods. We used these methods because, compared to supervised methods, they take less time to execute and require minimal to no manual tuning. Moreover, they are able to extract keywords with a high degree of accuracy and generalize well to different articles. Furthermore, due to the wide variety of categories of unsupervised learning methods, graph-based methods have been regularly applied in different projects, so we describe and use some of their most famous ones, such as TextRank, SingleRank, and PositionRank. These algorithms can identify important words and phrases in a given text, as well as identify relationships between them. Doing so can provide insights into the overall structure and meaning of the text. This makes them especially useful for finding patterns and making predictions in a variety of tasks, such as machine translation, text summarization, and sentiment analysis. Furthermore, graph-based methods are highly versatile and can be adapted to different datasets and tasks. This makes them ideal for use in benchmark datasets. The results inferred from these methods confirm the comparisons made between the methods employed in other papers.

Article number: 8

Keywords: Keyword Extraction, Persian Dataset, Unsupervised Learning, Graph-Based Methods, Information Retrieval

Full-Text [PDF 639 kb] (190 Downloads)

Type of Study: Research | Subject: Paper
Received: 2022/09/5 | Accepted: 2024/12/4 | Published: 2025/03/18 | ePublished: 2025/03/18

References

1. E. Doostmohammadi, M. H. Bokaei, and H. Sameti, "PerKey: A Persian News Corpus for Keyphrase Extraction and Generation," in 9th International Symposium on Telecommunication: With Emphasis on Information and Communication Technology, IST 2018, 2019, pp. 460-465. [DOI:10.1109/ISTEL.2018.8661095]

2. Y. MATSUO and M. ISHIZUKA, "Keyword Extraction From a Single Document Using Word Co-Occurrence Statistical Information," Int. J. Artif. Intell. Tools, vol. 13, no. 01, pp. 157-169, 2004. [DOI:10.1142/S0218213004001466]

3. M. S. Paukkeri and T. Honkela, "Likey: Unsupervised language-independent keyphrase extraction," ACL 2010 - SemEval 2010 - 5th Int. Work. Semant. Eval. Proc., pp. 162-165, 2010.

4. A. J. P. Tixier, F. D. Malliaros, and M. Vazirgiannis, "A graph degeneracy-based approach to keyword extraction," EMNLP 2016 - Conf. Empir. Methods Nat. Lang. Process. Proc., pp. 1860-1870, 2016. [DOI:10.18653/v1/D16-1191]

5. P. Meladianos, A. J. P. Tixier, G. Nikolentzos, and M. Vazirgiannis, "Real-time keyword extraction from conversations," 15th Conf. Eur. Chapter Assoc. Comput. Linguist. EACL 2017 - Proc. Conf., vol. 2, pp. 462-467, 2017. [DOI:10.18653/v1/E17-2074]

6. B. Škrlj, A. Repar, and S. Pollak, "RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2019, vol. 11816 LNAI, pp. 311-323. [DOI:10.1007/978-3-030-31372-2_26]

7. K. Bennani-Smires, C. Musat, A. Hossmann, M. Baeriswyl, and M. Jaggi, "Simple unsupervised keyphrase extraction using sentence embeddings," in CoNLL 2018 - 22nd Conference on Computational Natural Language Learning, Proceedings, 2018, pp. 221-229. [DOI:10.18653/v1/K18-1022]

8. R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt, "YAKE! Keyword extraction from single documents using multiple local features," Inf. Sci. (Ny)., vol. 509, pp. 257-289, 2020. [DOI:10.1016/j.ins.2019.09.013]

9. M. Azarafza and M. Feizi-Derakhshi, "Textrank-based microblogs keyword extraction method for Persian language," 2020.

10. E. Mehrabi, A. Mohebi, and A. Ahmadi, "Improved keyword extraction for persian academic texts using RAKE algorithm; case study: Persian theses and dissertations," Iran. J. Inf. Process. Manag., vol. 37, no. 1, pp. 197-228, 2021. [DOI:10.52547/jipm.37.1.197]

11. S. Duari and V. Bhatnagar, "Complex Network based Supervised Keyword Extractor," Expert Syst. Appl., vol. 140, 2020. [DOI:10.1016/j.eswa.2019.112876]

12. S. Lazemi, H. Ebrahimpour-Komleh, and N. Noroozi, "PAKE: a supervised approach for Persian automatic keyword extraction using statistical features," SN Appl. Sci., vol. 1, no. 12, 2019. [DOI:10.1007/s42452-019-1627-5]

13. A. Sharifi and M. A. Mahdavi, "Supervised approach for keyword extraction from Persian documents using lexical chains," Signal Data Process., vol. 15, no. 4, pp. 95-110, 2019. [DOI:10.29252/jsdp.15.4.95]

14. H. Veisi, N. Aflaki, and P. Parsafard, "Variance-based features for keyword extraction in Persian and English text documents," Sci. Iran., vol. 27, no. 3 D, pp. 1301-1315, 2020.

15. E. Doostmohammadi, M. H. Bokaei, and H. Sameti, "Persian Keyphrase Generation Using Sequence-to-Sequence Models," ICEE 2019 - 27th Iran. Conf. Electr. Eng., pp. 2010-2015, 2019. [DOI:10.1109/IranianCEE.2019.8786505]

16. F. Liu, X. Huang, W. Huang, and S. X. Duan, "Performance evaluation of keyword extraction methods and visualization for student online comments," Symmetry (Basel)., vol. 12, no. 11, pp. 1-20, 2020. [DOI:10.3390/sym12111923]

17. Y. Wang and J. Zhang, "Keyword extraction from online product reviews based on bi-directional LSTM recurrent neural network," IEEE Int. Conf. Ind. Eng. Eng. Manag., vol. 2017-Decem, pp. 2241-2245, 2018. [DOI:10.1109/IEEM.2017.8290290]

18. M. Tang, P. Gandhi, M. A. Kabir, C. Zou, J. Blakey, and X. Luo, "Progress Notes Classification and Keyword Extraction using Attention-based Deep Learning Models with BERT," 2019.

19. S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin, "SemEval-2010 Task 5: Automatic keyphrase extraction from scientific articles," in ACL 2010 - SemEval 2010 - 5th International Workshop on Semantic Evaluation, Proceedings, 2010, pp. 21-26.

20. A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," pp. 216-223, 2003. [DOI:10.3115/1119355.1119383]

21. X. Wan and J. Xiao, "Single document keyphrase extraction using neighborhood knowledge," Proc. Natl. Conf. Artif. Intell., vol. 2, pp. 855-860, 2008.

22. T. D. Nguyen and M.-Y. Kan, "Keyphrase Extraction in Scientific Publications," Asian Digit. Libr. Look. Back 10 Years Forg. New Front., pp. 317-326, 2008. [DOI:10.1007/978-3-540-77094-7_41]

23. N. Giarelis and N. Karacapilidis, "Deep learning and embeddings-based approaches for keyphrase extraction: a literature review," Knowl. Inf. Syst., 2024. [DOI:10.1007/s10115-024-02164-w]

24. G. Ashqar and A. Mutlu, "A Comparative Assessment of Various Embeddings for Keyword Extraction," HORA 2023 - 2023 5th Int. Congr. Human-Computer Interact. Optim. Robot. Appl. Proc., 2023. [DOI:10.1109/HORA58378.2023.10156762]

25. L. C. Chen and K. H. Chang, "An entropy-based corpus method for improving keyword extraction: An example of sustainability corpus," Eng. Appl. Artif. Intell., vol. 133, 2024. [DOI:10.1016/j.engappai.2024.108049]

26. Z. H. Amur, Y. K. Hooi, G. M. Soomro, H. Bhanbhro, S. Karyem, and N. Sohu, "Unlocking the Potential of Keyword Extraction: The Need for Access to High-Quality Datasets," Appl. Sci., vol. 13, no. 12, 2023. [DOI:10.3390/app13127228]

27. P. He, J. Huang, and M. Li, "Text Keyword Extraction Based on GPT," Proc. 2024 27th Int. Conf. Comput. Support. Coop. Work Des. CSCWD 2024, pp. 1394-1398, 2024. [DOI:10.1109/CSCWD61410.2024.10580849] []

28. S. Mohtaj, B. Roshanfekr, A. Zafarian, and H. Asghari, "Parsivar: A language processing toolkit for Persian," in LREC 2018 - 11th International Conference on Language Resources and Evaluation, pp. 1112-1118, 2019.

29. Najaf Project, "Synonyms & Antonyms Set in Persian," 2021. http://najafproj.ir/datasets/syn-ant-set-in-persian/.

30. R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," 2004 Empir. Methods Nat. Lang. Process., 2004.

31. C. D. Manning, P. Raghavan, and H. Schutze, "Introduction to Information Retrieval," in Introduction to information retrieval, Stanford: Cambridge University Press, 2010.

32. C. Florescu and C. Caragea, "PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents," Proc. ACL, pp. 1105-1115, 2017. [DOI:10.18653/v1/P17-1102]

33. F. Boudin, "Pke: An open source python-based keyphrase extraction toolkit," COLING 2016 - 26th Int. Conf. Comput. Linguist. Proc. COLING 2016 Syst. Demonstr., pp. 69-73, 2016.

34. A. Hosseini, "PERKE," 2021

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote