Volume 14, Issue 2 (9-2017)                   JSDP 2017, 14(2): 59-74 | Back to browse issues page


XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Sajadi S M B, Rashidi H, Minaei bidgoli B. A New Approach for Extracting Named Entity in Classical Arabic. JSDP 2017; 14 (2) :59-74
URL: http://jsdp.rcisp.ac.ir/article-1-295-en.html
Allameh Tabataba'i University
Abstract:   (5973 Views)

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result clustering, etc. While most of these researches are based on Modern Standard Arabic (MSA), in this paper, we focus on Classical Arabic (CA) literature. We propose a corpus called NoorCorp with 130k labeled words for research purposes which is annotated by expert human resources manually. This corpus is based on a Historic-Islamic book of 1200 years ago including 1843 sentences and 127550 words. We also collected about 18k proper names from old Hadith books as a gazetteer which is called NoorGazet used as a future. In this paper, we propose a new approach to extract named entities (NEs) including person, location, organization and time. We use hybrid approach benefiting from advantages of Rule based approach and Machine learning approach. We divided the NoorCorp into two parts of training and test sets containing 80% and 20% of the data set respectively. Prediction model, based on Boosting method, was developed in two steps which Adaboost.M1 is employed to identify NEs and Adaboost.M2 is employed to classify NEs. There are many methods using multiple classifiers as voters and summing up their results, among which, ensemble methods are those which generate multiple hypotheses using the same base learner. We developed an ensemble consisting of 50 members (classifiers) based on decision stump to implement the weak learner. Since only 17% of the text data is composed of name entity labels, we had to deepen the tree while restricting pruning. We exploited tokenizing, part of speech (POS) tagging, and base phrase chunking (BPC) to overcome linguistic obstacles in Arabic including Meaning ambiguity, Optional diacritics, Complex morphology and Nonstandard written text. Moreover, using a statistical technique, the most frequently used words extracted as key words. Results show that performance of the method is better than decision tree as the base classifier. An overall F-measure value of 86.85 obtained which is better than base line about 20% and CART decision tree about 12%. Since CA corpus consists of simpler linguistic patterns compared to MSA, we applied the proposed approach on ANERCorp as Modern Standard Arabic corpus. Results show that the proposed model outcome on CA corpus is about 19% better than MSA. This result is due to the fact that there are plenty of NEs entered to MSA from other languages. These proper names do not have specific patterns and do not exist in the gazetteer. In addition, many NE’s are not distributed uniformly in ANERcorp which considerably reduces the results accuracy.
  
 

Full-Text [PDF 5866 kb]   (2388 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2014/12/1 | Accepted: 2017/03/24 | Published: 2017/10/21 | ePublished: 2017/10/21

References
1. [1] D. Nadeau and S. Sekine, "A survey of named entity recognition and classification," Lingvisticae Investig., vol. 30, no. 1, pp. 3–26, 2007. [DOI:10.1075/li.30.1.03nad]
2. [2] M. Oudah and K. Shaalan, "A Pipeline Arabic Named Entity Recognition using a Hybrid Approach.," Coling, vol. 2, no. December 2012, pp. 2159–2176, 2012. [PMID]
3. [3] S. Abuleil and M. Evens, "Extracting Names From Arabic Text for Question-Answering Systems.," Riao, pp. 638–647, 2004.
4. [4] R. Koulali and A. Meziane, "A contribution to arabic named entity recognition," in International Conference on ICT and Knowledge Engineering, 2012, pp. 46–52. [DOI:10.1109/ICTKE.2012.6408570]
5. [5] K. Shaalan, "A Survey of Arabic Named Entity Recognition and Classification," Comput. Linguist., vol. 40, no. July 2013, pp. 469–510, 2014. [PMCID]
6. [6] N. Y. Habash, "Introduction to Arabic natural language processing," Synth. Lect. Hum. Lang. Technol., vol. 3, no. 1, pp. 1–187, 2010. [DOI:10.2200/S00277ED1V01Y201008HLT010]
7. [7] H. Al-Jumaily, P. Martínez, J. L. Martínez-Fernández, and E. Van der Goot, "A real time Named Entity Recognition system for Arabic text mining," Lang. Resour. Eval., vol. 46, no. 4, pp. 543–563, 2012. [DOI:10.1007/s10579-011-9146-z]
8. [8] M. Korayem, D. Crandall, and M. Abdul-Mageed, "Subjectivity and sentiment analysis of arabic: A survey," Adv. Mach. Learn. …, 2012.
9. [9] Y. Maynard, D., Tablan, V., Ursu, C., Cunningham, H. ve Wilks, "Named Entity Recognition from Diverse Text Types," in Recent Advances in Natural Language Processing, Springer, 2001, pp. 440–451.
10. [10] I. a Alkharashi, "Person Named Entity Generation and Recognition for Arabic Language," in the Proceedings of 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009, pp. 205–208.
11. [11] B. Vazirnejad, F. Soltanzadeh, M. Mahdavi, and M. Moradi, "Sharif Text Editor: A Persian Editor and Spell Checker System.," JSDP, vol. 12, no. 4, pp. 43–52, 2016.
12. [12] I. A. Al-sughaiyer and I. A. Al-kharashi, "Arabic Morphological Analysis Techniques : A Comprehensive Survey," J. Am. Soc. Information Science and Technology, vol. 55, no. 3, pp. 189–213, 2004. [DOI:10.1002/asi.10368]
13. [13] K. Darwish, A. Abdelali, and H. Mubarak, "Using Stem-Templates to improve Arabic POS and Gender/Number Tagging," in International Conference on Language Resources and Evaluation (LREC-2014), 2014, pp. 2926–2931.
14. [14] I. Zitouni, J. Sorensen, X. Luo, and R. Florian, "The impact of morphological stemming on Arabic mention detection and coreference resolution," Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, June 29, pp. 63–70, 2005. [DOI:10.3115/1621787.1621800]
15. [15] Y. Benajiba, P. Rosso, M. Bened, and J. Bened iRuiz, "ANERsys : An Arabic Named Entity Recognition System Based on Maximum Entropy," Names, pp. 143–153, 2007.
16. [16] Y. Benajiba and P. Rosso, "ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information.," in 3rd Indian International Conference on Artificial Intelligence (IICAI-07), 2007, pp. 1814–1823.
17. [17] Y. Benajiba and P. Rosso, "Arabic named entity recognition using conditional random fields," Proc. Work. HLT NLP within …, 2008.
18. [18] Y. Benajiba, M. Diab, and P. Rosso, "Arabic named entity recognition using optimized feature sets," Proc. Conf. Empir. Methods Nat. Lang. Process. EMNLP 08, no. October, pp. 284–293, 2008. [DOI:10.3115/1613715.1613755]
19. [19] D. Valencia, "Arabic Named Entity Recognition," Audio, Speech, Lang. Process. IEEE Trans., vol. 17, no. May, pp. 151–152, 2010.
20. [20] S. Abdallah, K. Shaalan, and M. Shoaib, "Integrating rule-based system with classification for arabic named entity recognition," Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7181 LNCS, no. PART 1, pp. 311–322, 2012. [DOI:10.1007/978-3-642-28604-9_26]
21. [21] K. Shaalan and M. Oudah, "A hybrid approach to Arabic named entity recognition," J. Inf. Sci., vol. 40, no. 1, pp. 67–87, 2014. [DOI:10.1177/0165551513502417]
22. [22] M. A. Meselhi, H. M. Abo Bakr, I. Ziedan, and K. Shaalan, "Hybrid Named Entity Recognition-Application to Arabic Language," in Computer Engineering & Systems (ICCES), 2014 9th International Conference on, 2014, pp. 80–85. [DOI:10.1109/ICCES.2014.7030933]
23. [23] M. A. Meselhi, H. M. A. Bakr, I. Ziedan, and K. Shaalan, "A Novel Hybrid Approach to Arabic Named Entity," in Machine Translation, Springer, 2014, pp. 93–103.
24. [24] F. Enríquez, F. L. Cruz, F. J. Ortega, C. G Vallejo, and J. A. Troyano, "A comparative study of classifier combination applied to NLP tasks," Inf. Fusion, vol. 14, no. 3, pp. 255–267, 2013. [DOI:10.1016/j.inffus.2012.05.001]
25. [25] X. Carreras, L. Marquez, and L. Padró, "Named entity extraction using adaboost," 2002, pp. 1–4.
26. [26] X. Carreras, L. Màrquez, and L. Padró, "A simple named entity extractor using AdaBoost," … seventh Conf. Nat. …, 2003. [DOI:10.3115/1119176.1119197]
27. [27] G. Szarvas, R. Farkas, and A. Kocsor, "A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms," Structure, pp. 267–278, 2006. [DOI:10.1007/11893318_27]
28. [28] M. Asgari Bidhendi and B. Minaei Bidgoli, "Extracting person names using name candidate injection in a conditional random field model for Arabic language," JSDP, vol. 11, no. 1, pp. 73–85, 2014.
29. [29] M. Rezaei Sharifabadi and P. Khosravizadeh, "Automatic Labeling of Semantic Roles in Persian Sentences using Dependency Trees," JSDP, vol. 13, no. 1, pp. 27–38, 2016.
30. [30] F. Al Shamsi and A. Guessoum, "A hidden Markov model-based POS tagger for Arabic," in Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France, 2006, pp. 31–42.
31. [31] A. Salimibadr and M. M. Homayounpour, "Phrase chunking in Persian texts," JSDP, vol. 10, no. 2, pp. 69–86, 2014.
32. [32] M. Diab, "Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization, POS tagging, and Base Phrase Chunking," Proc. Second Int. Conf. Arab. Lang. Resour. Tools, pp. 285–288, 2009.
33. [33] L. Kuncheva, "Combining Pattern Classifiers methods and algorithms. John Wiley&Sons," Inc. Publ. Hoboken, 2004. [DOI:10.1002/0471660264]
34. [34] C. M. Bishop and others, Pattern recognition and machine learning, vol. 1. springer New York, 2006.
35. [35] R. Tabatabaei, M. R. Feizi-Derakhshi, and S. Masoumi, "Proposing an intelligent and semantic-based system for Evaluating Text Summarizers," JSDP, vol. 12, no. 2, pp. 3–11, 2015.

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2015 All Rights Reserved | Signal and Data Processing