Improving Precision of Keywords Extracted From Persian Text Using Word2Vec Algorithm

Hasni Ahangar, Mohammad Reza; Amiri jezeh, Ali

doi:10.52547/jsdp.18.1.60

Volume 18, Issue 1 (5-2021) JSDP 2021, 18(1): 60-51 | Back to browse issues page

‎ 10.52547/jsdp.18.1.60

Mendeley

Zotero

RefWorks

Hasni Ahangar M R, Amiri jezeh A. Improving Precision of Keywords Extracted From Persian Text Using Word2Vec Algorithm. JSDP 2021; 18 (1) :60-51
URL: http://jsdp.rcisp.ac.ir/article-1-858-en.html

Improving Precision of Keywords Extracted From Persian Text Using Word2Vec Algorithm

Mohammad Reza Hasni Ahangar ^*

, Ali Amiri jezeh

Imam Hossein University

Abstract: (3846 Views)

Keywords can present the main concepts of the text without human intervention according to the model. Keywords are important vocabulary words that describe the text and play a very important role in accurate and fast understanding of the content. The purpose of extracting keywords is to identify the subject of the text and the main content of the text in the shortest time. Keyword extraction plays an important role in the fields of text summarization, document labeling, information retrieval, and subject extraction from text. For example, summarizing the contents of large texts into smaller texts is difficult, but having keywords in the text can make you aware of the topics in the text. Identifying keywords from the text with common methods is time-consuming and costly. Keyword extraction methods can be classified into two types with observer and without observer. In general, the process of extracting keywords can be explained in such a way that first the text is converted into smaller units called the word, then the redundant words are removed and the remaining words are weighted, then the keywords are selected from these words. Our proposed method in this paper for identifying keywords is a method with observer. In this paper, we first calculate the word correlation matrix per document using a feed forward neural network and Word2Vec algorithm. Then, using the correlation matrix and a limited initial list of keywords, we extract the closest words in terms of similarity in the form of the list of nearest neighbors. Next we sort the last list in descending format, and select different percentages of words from the beginning of the list, and repeat the process of learning the neural network 10 times for each percentage and creating a correlation matrix and extracting the list of closest neighbors. Finally, we calculate the average accuracy, recall, and F-measure. We continue to do this until we get the best results in the evaluation, the results show that for the largest selection of 40% of the words from the beginning of the list of closest neighbors, the acceptable results are obtained. The algorithm has been tested on corpus with 800 news items that have been manually extracted by keywords, and laboratory results show that the accuracy of the suggested method will be 78%.

Keywords: keywords, word2vec algorithm, neural network, giving weight features

Full-Text [PDF 1265 kb] (1252 Downloads)

Type of Study: Applicable | Subject: Paper
Received: 2018/04/22 | Accepted: 2021/02/27 | Published: 2021/05/22 | ePublished: 2021/05/22

References

1. [1] F. Liu, X. Huang, W. Huang, and S. X. Duan, "Performance Evaluation of Keyword Extraction Methods and Visualization for Student Online Comments," Symmetry, vol. 12, no. 11, p. 1923, 2020. [DOI:10.3390/sym12111923]

2. [2] H. Yan, Q. He, and W. Xie, "Crnn-Ctc Based Mandarin Keywords Spotting," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7489-7493. [DOI:10.1109/ICASSP40776.2020.9054618]

3. [3] Y. Zhang, M. Tuo, Q. Yin, L. Qi, X. Wang, and T. Liu, "Keywords extraction with deep neural network model," Neurocomputing, vol. 383, pp. 113-121, 2020. [DOI:10.1016/j.neucom.2019.11.083]

4. [4] M. Mohammadi and M. Analouyi, "Keyword extraction in Persian documents," presented at the 13th Conference of Iran Computer Association, Kish, Iran, 2007.

5. [5] H. Veisi, N. Aflaki, and P. Parsafard, "Variance-based features for keyword extraction in Persian and English text documents," Scientia Iranica, vol. 27, no. 3, pp. 1301-1315, 2020.

6. [6] C. Zhang, "Automatic keyword extraction from documents using conditional random fields," Journal of Computational Information Systems, vol. 4, no. 3, pp. 1169-1180, 2008.

7. [7] X. Wan and J. Xiao, "Single Document Keyphrase Extraction Using Neighborhood Knowledge," in AAAI, 2008, vol. 8, pp. 855-860.

8. [8] D. B. Bracewell, F. Ren, and S. Kuriowa, "Multilingual single document keyword extraction for information retrieval," in 2005 International Conference on Natural Language Processing and Knowledge Engineering, 2005: IEEE, pp. 517-522.

9. [9] P. D. Turney, "Learning to extract keyphrases from text," arXiv preprint cs/021, 2002, 2013.

10. [10] Y. Matsuo and M. Ishizuka, "Keyword extraction from a single document using word co-occurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 01, pp. 157-169, 2004. [DOI:10.1142/S0218213004001466]

11. [11] S. Rose, D. Engel, N. Cramer, and W. Cowley, "Automatic keyword extraction from individual documents," Text mining: applications and theory, vol. 1, pp. 1-20, 2010. [DOI:10.1002/9780470689646.ch1]

12. [12] J. Wang, H. Peng, and J.-s. Hu, "Automatic keyphrases extraction from document using neural network," in Advances in Machine Learning and Cybernetics: Springer, 2006, pp. 633-641. [DOI:10.1007/11739685_66]

13. [13] A. Ahmadi and T. Hosseinkhah, "Extract keywords from a text using neural networks," presented at the 10th International Conference on Industrial Engineering, Tehran, Iran Industrial Engineering Association, Amirkabir University of Technology, 2013.

14. [14] S. De Deyne, S. Verheyen, and G. Storms, "Structure and organization of the mental lexicon: A network approach derived from syntactic dependency relations and word associations," in Towards a theoretical framework for analyzing complex linguistic networks: Springer, 2016, pp. 47-79. [DOI:10.1007/978-3-662-47238-5_3]

15. [15] E. L. Lin and G. L. Murphy, "Thematic relations in adults' concepts," Journal of experimental psychology: General, vol. 130, no. 1, p. 3, 2001. [DOI:10.1037/0096-3445.130.1.3] [PMID]

16. [16] F. Liu, D. Pennell, F. Liu, and Y. Liu, "Unsupervised approaches for automatic keyword extraction using meeting transcripts," in Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, 2009, pp. 620-628. [DOI:10.3115/1620754.1620845]

17. [17] X. Ao, X. Yu, D. Liu, and H. Tian, "News keywords extraction algorithm based on TextRank and classified TF-IDF," in 2020 International Wireless Communications and Mobile Computing (IWCMC), 2020: IEEE, pp. 1364-1369. [DOI:10.1109/IWCMC48107.2020.9148491]

18. [18] F. Liu, X. Huang, and W. Huang, "Comparing Machine Learning Algorithms to Predict Topic Keywords of Student Comments," in International Conference on Cooperative Design, Visualization and Engineering, 2020: Springer, pp. 178-183. [DOI:10.1007/978-3-030-60816-3_20]

19. [19] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt, "YAKE! Keyword extraction from single documents using multiple local features," Information Sciences, vol. 509, pp. 257-289, 2020. [DOI:10.1016/j.ins.2019.09.013]

20. [20] J. R. Thomas, S. K. Bharti, and K. S. Babu, "Automatic keyword extraction for text summarization in e-newspapers," in Procee-dings of the international conference on informatics and analytics, 2016, pp. 1-8. [DOI:10.1145/2980258.2980442]

21. [21] D. M. Allen, "The relationship between variable selection and data agumentation and a method for prediction," technometrics, vol. 16, no. 1, pp. 125-127, 1974. [DOI:10.1080/00401706.1974.10489157]

22. [22] M. Stone, "Cross‐validatory choice and assessment of statistical predictions," Journal of the Royal Statistical Society: Series B (Methodological), vol. 36, no. 2, pp. 111-133,1974. [DOI:10.1111/j.2517-6161.1974.tb00994.x]

23. [23] M. Stone, "An asymptotic equivalence of choice of model by cross‐validation and Akaike's criterion," Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 44-47, 1977. [DOI:10.1111/j.2517-6161.1977.tb01603.x]

24. [24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word represen-tations in vector space," arXiv preprint arXiv:1301.3781, 2013.

25. [25] C. Manning and R. Socher, "Natural language processing with deep learning," Lecture Notes Stanford University School of Engineering, 2017.

26. [26] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, and J. Hu, "Patent keyword extraction algorithm based on distributed representation for patent classification," Entropy, vol. 20, no. 2, pp. 104, 2018. [DOI:10.3390/e20020104] [PMID] [PMCID]

27. [27] H. Omid and S. Saeedeh, Sadidpour, "Automatic extraction of Persian short text keywords using word2vec," Electronic and cyber defense, vol. 8, 2, pp. 105-114, 2020.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote