An Approach for Extraction of Keywords and Weighting Words for Improvement Farsi Documents Classification

rezaie, vahideh; mohammadpour, mahid; parvin, hamid; nejatian, samad

doi:10.29252/jsdp.14.4.55

Volume 14, Issue 4 (3-2018) JSDP 2018, 14(4): 55-78 | Back to browse issues page

‎ 10.29252/jsdp.14.4.55

Mendeley

Zotero

RefWorks

rezaie V, mohammadpour M, parvin H, nejatian S. An Approach for Extraction of Keywords and Weighting Words for Improvement Farsi Documents Classification. JSDP 2018; 14 (4) :55-78
URL: http://jsdp.rcisp.ac.ir/article-1-449-en.html

An Approach for Extraction of Keywords and Weighting Words for Improvement Farsi Documents Classification

Vahideh Rezaie

, Mahid Mohammadpour

, Hamid Parvin

, Samad Nejatian ^*

Abstract: (6659 Views)

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. Authors claim that extraction of more meaningful keywords out of documents can be attained via employment of a thesaurus. The keywords extracted by applying thesaurus, can improve the document classification. The steps to be taken to increase the comprehensiveness of search should be such that in the first step the stop words are removed and the remaining words are stemmed. Then, with the help of a thesaurus are found words equivalent, hierarchical and dependent. Then, to determine the relative importance of words, a numerical weight is assigned to each word, which represents effect of the word on the subject matter and in comparison with other words used in the text. According to the steps above and with the help of a thesaurus, an accurate text classification is performed. In this method, the KNN algorithm is used for the classification. Due to the simplicity and effectiveness of this algorithm (KNN), there is a great deal of use in the classification of texts. The cornerstone of KNN is to compare with the text trained and text tested to determine their similarity between. The empirical results show the quality and accuracy of extracted keywords are satisfiable for users. They also confirm that the document classification has been enhanced. In this research, it has been tried to extract more meaningful keywords out of texts using thesaurus (which is a structured word-net) rather than not using it.

Keywords: thesaurus, information retrieval, extraction of keywords, weight

Full-Text [PDF 7340 kb] (2945 Downloads)

Type of Study: Research | Subject: Paper
Received: 2015/10/30 | Accepted: 2017/10/25 | Published: 2018/03/13 | ePublished: 2018/03/13

References

1. [1] F. Rad, H. Parvin, A. Dehbashi, B. Minaei, "A New Method for Automatic Indexing and Extract-ing Keywords for Information Retrieval and Clustering of Texts", Journal of Signal Processing and Data, Volume 13, No. 1, page 87-100, 2017.

2. [2] Dehbashi Hashem, Atoosa, "Improved clustering of Persian texts based on keywords using linguis-tic information and thesaurus". Master thesis, 2010.

3. [3] F. Yaghmaei, S. Tabodi, "Improving the Classification of Persian Texts in Weighted Neighboring Method", The First International Conference on Line Processing and Persian Language, 2012.

4. [4] M.R., Alagheband, M.R Saeedi Mohammadi, M.H Dezfulian, "clustering of center-based texts using the SVD method and utilizing neighborhoods", the first international conference on processing Persian language and language, 2012.

5. [5] A.R, Arasteh, M.H, Elahimanesh, A. Sharif, B. Minaei-Bidgoli, "Semantically Clustering of Persian Words", Proceeding of 1st International Conference on Persian Language Processing (ICPLP), Semnan, Iran, Sep. 5-6, 2012.

6. [6] Berry, W. Michael, and Castellanos, Malu, eds, Survey of text mining. New York: Springer, 2004. [DOI:10.1007/978-1-4757-4305-0]

7. [7] Borko, Harold, and Bernick, Myrna, "Automatic document classification", Journal of the ACM (JACM) 10, no. 2: 151-162, 1963. [DOI:10.1145/321160.321165]

8. [8] Cavnar, B. William, and Trenkle, M. John, "N-gram-based text categorization", Ann Arbor MI 48113, no. 2: 161-175, 1994.

9. [9] F. Colace, M. D. Santo, L. Greco, P. Napoletano, "Text classification using a few labeled examples", Journal of Computers in Human Behavior, Vol. 30, January 2014, pp. 689-697, 2014.

10. [10] Cleverdon, Cyril, "Optimizing convenient online access to bibliographic databases', Information services and Use 4, no. 1: 37-47, 1984. [DOI:10.3233/ISU-1984-41-204]

11. [11] D. Choi, B. Ko, H. Kim, P. Kim, Text analysis for detecting terrorism-related articles on the web, Journal of Network and Computer Applications, Vol. 38, pp. 16-21, 2014. [DOI:10.1016/j.jnca.2013.05.007]

12. [12] A. Díaz, M. Buenaga, L. A. Ure-a, and M. García, "Integrating Linguistic Resources in an Uniform Way for Text Classification Tasks", In First International Conference on Language Resources & Evaluation, Granada (Spain), 1998.

13. [13] M. Deegan, "Keyword Extraction with Thesauri and Content Analysis", URL: http://www.rlg.or-g/en/page.php?Page_ID=17068, 2004.

14. [14] Escudero, Gerard, Màrquez, Lluís, and Rigau, German, "Boosting applied to word sense disambiguation", Springer Berlin Heidelberg, 2000. [DOI:10.1007/3-540-45164-1_14]

15. [14] K. Frantzi, S. Ananiadou and H. Mima, Automatic Recognition of Multi-word Terms: the C-value/NC-value Method, Digital Libraries, 3(2), pp. 115-130, 2000. [DOI:10.1007/s007999900023]

16. [15] S. Forsyth, Richard, "New directions in text categorization", In Causal models and intelligent data management, pp. 151-185. Springer Berlin Heidelberg.

17. [16] N. Freitas, and A. Kaestner, "Automatic text summarization using a machine learning approach", 16th Brazilian Symposium on Artificial Intelligence (SBIA), Brazil. Vol. 398, 2005.

18. [17] Granitzer, Michael, Hierarchical text classifica-tion using methods from machine learning. Master's Thesis, Graz University of Technology, 2003.

19. [18] D. Hyun, "Automatic Keyword Extraction Using Category Correlation of Data", Heidelberg, pp. 224-230, 2006.

20. [19] Harter, Stephen P. "A probabilistic approach to automatic keyword indexing", Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science 26, no. 5: 280-289, 1975. [DOI:10.1002/asi.4630260504]

21. [20] Hassel, Martin, and Mazdak, Nima, FarsiSum: a Persian text summarizer. Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. Association for Computational Linguistics, 2004. [DOI:10.3115/1621804.1621826]

22. [21] Huang, Yan. "Support vector machines for text categorization based on latent semantic indexing", Electrical and Computer Engineering Department, The Johns Hopkins University, Tech. Rep, 2003.

23. [22] Kessler, Brett, Numberg, Geoffrey, and Schütze, Hinrich, Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 32-38. Association for Computational Linguistics, 1997. https://doi.org/10.3115/979617.979622 [DOI:10.3115/976909.979622]

24. [23] Knight, Kevin, Mining online text. Communica-tions of the ACM 42, no. 11: 58-61, 1999. [DOI:10.1145/319382.319394]

25. [24] Larkey, S, Leah, "Automatic essay grading using text categorization techniques", In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 90-95. ACM, 1998.

26. [25] Liu, Luying, Kang, Jianchu, Yu, Jing and Wang. Zhongliang, "A comparative study on unsupervised feature selection methods for text clustering. In Natural Language Processing and Knowledge Engineering", 2005. IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on, pp. 597-601. IEEE, 2005.

27. [26] H. P, Luhn, "11 Keyword-in-Context Index for Technical Literature (KWIC Index)", Readings in automatic language processing 1: 159, 1996.

28. [27] Manning, D. Christopher, "Foundations of statistical natural language processing", Edited by Hinrich Schütze. MIT press, 1999.

29. [28] Maron, Melvin Earl., "Automatic indexing: an experimental inquiry", Journal of the ACM (JACM) 8, no. 3: 404-417, 1961. [DOI:10.1145/321075.321084]

30. [29] Myers, Kary, Kearns, Michael, Singh, Satinder, and Walker, A. Marilyn, "A boosting approach to topic spotting on subdialogues", Family Life 27, no. 3: 1, 2000.

31. [30] Moschitti, Alessandro, "Answer filtering via text categorization in question answering systems", In Tools with Artificial Intelligence, Proceed-ings. 15th IEEE International Conference on, pp. 241-248. IEEE, 2003. [DOI:10.1109/TAI.2003.1250197]

32. [31] H. Parvin, B. Minaei-Bidgoli, and A. Dahbashi, "Improving persian text classification using persian thesaurus", In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 391-398. Springer Berlin Heidelberg, 2011.

33. [32] Sable, L. Carl, and Hatzivassiloglou, Vasileios. "Text-based approaches for non-topical image categorization", International Journal on Digital Libraries 3, no. 3: 261-275, 2000. [DOI:10.1007/s007990000038]

34. [33] Salton, Gerard, and Yang, Chung-Shu, "On the specification of term values in automatic indexing", Journal of documentation 29, no. 4: 351-372, 1973. [DOI:10.1108/eb026562]

35. [34] Schapire, E. Robert, and Singer, Yoram, "BoosTexter: A boosting-based system for text categorization", Machine learning 39, no. 2-3: 135-168, 2000. [DOI:10.1023/A:1007649029923]

36. [35] G. Tangil, J. E. Tapiador, P. Peris-Lopez, J. Blasco, Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families, Journal of Expert Systems with Applications, Vol. 41, No. 4, March 2014, pp. 1104-1117, 2014.

37. [36] G. Tsatsaronis, I. Varlamis, M. Vazirgiannis, "Text Relatedness Based on a Word Thesaurus", Journal of Artificial Intelligence Research, Vol. 37 pp.1-39, 2010.

38. [37] A. Zamanifar, B. Minaei-Bidgoli, and Sharifi, Mohsen. "A new hybrid farsi text summariza-tion technique based on term co-occurrence and conceptual property of the text. Software Engineering, Artificial Intelligence", Network-ing, and Parallel/Distributed Comput-ing, SNPD'08. Ninth ACIS International Conference on. IEEE, 2008.

39. [38] W. Witten, I.H. Medley, Thesaurus based automatic keyphrase indexing, ACM/IEEE-CS JCDL '06 (Joint Conference on Digital Libraries), 2006.

40. [39] Y. Zhang, N. Z. Heywood and E. Milios, "World Wide Web Site Summarization Web Intelligence and Agent Systems", Technical Report, 2006.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote