Supervised approach for keyword extraction from Persian documents using lexical chains

Sharifi, Atieh; Mahdavi, M.Amin

doi:10.29252/jsdp.15.4.95

Signal and Data Processing Journal A scientific journal officially licensed by the Commission for Scientific Publications of the (MSRT). Publisher: Research Ceter for Developmen of Technologies

EN FA

Volume 15, Issue 4 (3-2019) JSDP 2019, 15(4): 95-110 | Back to browse issues page

‎ 10.29252/jsdp.15.4.95

Mendeley

Zotero

RefWorks

Sharifi A, Mahdavi M. Supervised approach for keyword extraction from Persian documents using lexical chains. JSDP 2019; 15 (4) :95-110
URL: http://jsdp.rcisp.ac.ir/article-1-733-en.html

Supervised approach for keyword extraction from Persian documents using lexical chains

Atieh Sharifi ^*

, M.Amin Mahdavi

Abstract: (6644 Views)

Keywords are the main focal points of interest within a text, which intends to represent the principal concepts outlined in the document. Determining the keywords using traditional methods is a time consuming process and requires specialized knowledge of the subject. For the purposes of indexing the vast expanse of electronic documents, it is important to automate the keyword extraction task. Since keywords structure is coherent, we focus on the relation between words. Most of previous methods in Persian are based on statistical relation between words and didn’t consider the sense relations. However, by existing ambiguity in the meaning, using these statistic methods couldn’t help in determining relations between words. Our method for extracting keywords is a supervised method which by using lexical chain of words, new features are extracted for each word. Using these features beside of statistic features could be more effective in a supervised system. We have tried to map the relations amongst word senses by using lexical chains. Therefore, in the proposed model, “FarsNet” plays a key role in constructing the lexical chains. Lexical chain is created by using Galley and McKeown's algorithm that of course, some changes have been made to the algorithm. We used java version of hazm library to determine candidate words in the text. These words were identified by using POS tagging and Noun phrase chunking. Ten features are considered for each candidate word. Four features related to frequency and position of word in the text and the rest related to lexical chain of the word. After extracting the keywords by the classifier, post-processing performs for determining Two-word key phrases that were not obtained in the previous step. The dataset used in this research was chosen from among Persian scientific papers. We only used the title and abstract of these papers. The results depicted that using semantic relations, besides statistical features, would improve the overall performance of keyword extraction for papers. Also, the Naive Bayes classifier gives the best result among the investigated classifiers, of course, eliminating some of the features of the lexical chain improved its performance.

Keywords: Keyword Extraction, Persian Document, Supervised Learning, Lexical Chain, FarsNet

Full-Text [PDF 14235 kb] (2396 Downloads)

Type of Study: Applicable | Subject: Paper
Received: 2017/12/3 | Accepted: 2018/05/16 | Published: 2019/03/8 | ePublished: 2019/03/8

References

1. [1]J. Wang, J. Liu and C. Wang, "Keyword Extraction Based on PageRank," in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2007.

2. [2]X. Li and F. Song, "Keyphrase Extraction and Grouping Based on Association Rules," in FLAIRS Conference, Hollywood, Florida, 2015.

3. [3] B. Lott, "Survey of keyword extraction techniques," UNM Education, 2012.

4. [4] R. Nelken and S. M. Shieber, "Lexical chaining and word-sense-disambiguation," School of Engineering and Applied Sciences, Harvard University, Cambridge ,Technical Report TR-06-07, MA, 2007.

5. [5] G. Ercan, "Automated text summarization and keyphrase extraction," M.S. thesis, bilkent univer-sity, Ankara, Turkey, 2006.

6. [6] M. Shamsfard, "Towards Semi Automatic Construction of a Lexical Ontology for Persian," in sixth International Conference on Language Resources and Evaluation, Morocco, 2008.

7. [7] M. Galley and K. McKeown, "Improving word sense disambiguation in lexical chaining," IJCAI, vol. 3, pp. 1486-1488, 2003.

8. [8] k. Hasan and v. Ng, "Automatic Keyphrase Extraction: A Survey of the State of the Art," in ACL, 2014. [DOI:10.3115/v1/P14-1119]

9. [9] C. Wu, M. Marchese and J. Jiang, "Machine Learning-Based Keywords Extraction for Scien-tific Literature," Journal of Universal Computer Science, vol. 13, no. 10, pp. 1471-1483, 2007.

10. [10] S. Beliga, "Keyword extraction: a review of methods and approaches," University of Rijeka, Department of Informatics, Rijeka, 2014.

11. [11] S. beliga, A. Mestrovic and S. Martincic, "An overview of graph-based keyword extraction methods and approaches," Journal of information and organizational sciences, vol. 39, no. 1, pp. 1-20, 2015.

12. [12] T. Pay and S. Lucci, "Automatic Keyword Extraction: An Ensemble Method," in 2017 IEEE International Conference on Big Data, Boston, 2017. [DOI:10.1109/BigData.2017.8258552]

13. [13] M. Johansson and P. Lindstrom, "Keyword Extraction using Machine Learning," M.S. thesis, Gothenburg University, Gothenburg, Sweden, 2010.

14. [14] A. Hulth, "Combining machine learning and natural language processing for automatic key-word extraction," Ph.D. dissertation, Stockholm University, Stockholms, Sweden, 2004.

15. [15] Y. HaCohen-kerner, Z. Gross and A. Masa, "Automatic extraction and learning of keyphrases from scientific articles," in International Con-ference on Intelligent Text Processing and Computational Linguistics. Springer Berlin Heidelberg, 2005. [DOI:10.1007/978-3-540-30586-6_74]

16. [16] O. Medelyan and I. H. Witten, "Thesaurus based automatic keyphrase indexing," in Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries. ACM, 2006. [DOI:10.1145/1141753.1141819]

17. [17] C. Zhang, H. WANG, Y. LIU, D. WU, Y. LIAO and B. WANG, "Automatic Keyword Extraction from Documents Using Conditional Random Fields," Computational Information Systems, vol. 4, no. 3, pp. 1169-1180, 2008.

18. [18] M. Krapivin, A. Autayeu, M. Ma, E. Blanzieri and N. Segata, "Keyphrases extraction from scientific documents: improving machine learning approa-ches with natural language processing," in International Conference on Asian Digital Lib-raries. Springer Berlin Heidelberg, 2010. [DOI:10.1007/978-3-642-13654-2_12]

19. [19] C. Caragea and F. Bulgarov, "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach," in Empirical Methods in Natural Language Processing (EMNLP), Doha, 2014. [DOI:10.3115/v1/D14-1150]

20. [20] O. Alqaryouti, T. A. Farouk, A. R. Nabhan and K. Shaalan, "Graph-Based Keyword Extraction," in Intelligent Natural Language Processing: Trends and Applications, Springer, Cham, 2018, pp. 159-172. [DOI:10.1007/978-3-319-67056-0_9]

21. [21] Z. Liu and P. Liu, "Clustering to Find Exemplar Terms for Keyphrase Extraction," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009. [DOI:10.3115/1699510.1699544]

22. [22] S. Arabi Narei, M.Vahidi Asl and B.Minaei Bidgoli, "Keyword extraction for persian text classification,"in First Iran Data Mining Conf-erence, Amir kabir university ,2007.

23. [23]M. Mohammadi Janghara and M.Analouei , " keyword extraction from persian documents", in 13th Annual Conference of Computer Society of Iran, kish island- computer society, Sharif Univer-sity of Technology, 2008.

24. [24] A. Ahmadi and T. Hoseinikhah, "Keyword Extraction from a text using Neural Network," in Tenth international industrial engineering con-ference, Amirkabir University, 2014.

25. [25]F. Rad, H. Parvin, A. Dehbashi, B. Minaei, "A New Method for Automatic Indexing and Extract-ing Keywords for Information Retrieval and Clustering of Texts", Journal of Signal Processing and Data, Volume 13, No. 1, page 87-100, 2017.

26. [26]H. G. Silber and K. F. McCoy, "Efficiently computed lexical chains as an intermediate representation for automatic text summarization," Association for Computational Linguistics, vol. 28, no. 4, pp. 487-496, 2002. [DOI:10.1162/089120102762671954]

27. [27]M. Enss, "An investigation of word sense disambiguation for improving lexical chaining," M.S. thesis, Waterloo University, Waterloo, Canada, 2006.

28. [28]X. Li, "Keyphrase Extraction and Grouping Based on Association Rules," M.S. thesis, Guelph University, Guelph, Canada, 2014.

29. [29]B. Lott, "Survey of keyword extraction tech-niques," December, 2012.

30. [30]S. Beliga, "Keyword extraction: a review of me-thods and approaches," unpublished, 2014.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.