Volume 14, Issue 4 (3-2018)                   JSDP 2018, 14(4): 43-54 | Back to browse issues page

XML Persian Abstract Print

Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Improving Named Entity Recognition Using Izafe in Farsi. JSDP. 2018; 14 (4) :43-54
URL: http://jsdp.rcisp.ac.ir/article-1-495-en.html
Abstract:   (83 Views)

Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine translation, semantic search, and relation extraction and quotation recognition systems. Named entity recognition in the Persian language is far more complex and more difficult than English. In English texts usually proper nouns begin with capital letters and this feature makes it easy to identify named entities, but this feature is absent in Persian language texts. To create a named entity recognition system, generally three methods are being used which include rule-based, machine-learning-based and hybrid methods. Each of these methods has its own advantages and disadvantages. Lack of named entity labeled data is the greatest challenge in Persian text. Because of this problem usually rule-based methods used to extract entities.
In this paper firstly, the dictionary of organizations, places and people were extracted from Wikipedia. Wikipedia is one of the best sources for extracting entities in which more than 200000 Farsi-named entities are known to exist. The proposed algorithm classify each Wikipedia article title by using its categories. Each of Wikipedia titles has several categories that can be used to partially identify the named entity type. Then named entity recognition accuracy (precision) was increased using the rules. These rules can be divided into 3 categories that include morphological rules, adjacency and text patterns. The most important rules are adjacency rules. By using these rules the type of entity with the word nearby each entity (like Mr, Mrs , …) can be identified. To evaluate the system, 42000 tokens of BijanKhan corpus were manually annotated (labeled). Early F-measure was calculated 78.79 percent. In continue, named entity recognition accuracy (precision) improved using izāfe which is one of the important Persian language features and 81.94 percent for F-measure was achieved. The results showed that using izāfe in named entity recognition systems significantly increases their accuracy.

Full-Text [PDF 4769 kb]   (24 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2016/03/1 | Accepted: 2017/05/5 | Published: 2018/03/13 | ePublished: 2018/03/13

1. [1] Esfahani.A, Rahati.S, Jahangiri.N. "Identification and classification names in Persian texts ." Signal and Data Processing Journal ,No 13,78-77, 1389
2. [2] Mortazavi.P, Shamsfard.M."Named Entity Recognition In Persian Texts". 15nd National Computer Society of Iran Conference.tehran. Power Technology Development Center.Tehran. 1388
3. [3] Bijankhan.M, Sheykhzadegan.J, Bahrani.M and Ghayoomi.M. "Lessons from Building a Persian Written Corpus:Peykare." Language Resources and Evaluation.2011. pp. 143-164. [DOI:10.1007/s10579-010-9132-x]
4. [4] Chieu, Hai Leong, and Hwee Tou Ng. "Named entity recognition: a maximum entropy approach using global information." Proceedings of the 19th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 2002. [DOI:10.3115/1072228.1072253]
5. [5] Das, Arjun, Debasis Ganguly, and Utpal Garain. "Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language." ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16.3 (2017): 18.
6. [6] Elsebai, Ali. "Arabic Proper Names Recognition Using Heuristics." Proceeding of the 9th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNET), ISBN. 2008.
7. [7] B. Farber, D. Freitag et al."Improving NER in Arabic Using a Morphological Tageer". the 6th International Conference on Language Resources and Evaluation,LREC. 2008.
8. [8] Farmakiotou, Dimitra, et al. "Rule-based named entity recognition for Greek financial texts." Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX 2000). 2000.
9. [9] Grishman R, Sundheim B." Message Understanding Conference-6: A Brief History". InCOLING 1996 Aug 5 (Vol. 96, pp. 466-471).1996
10. [10] Mansouri, Alireza, Lilly Suriani Affendey, and Ali Mamat. "Named entity recognition approaches." International Journal of Computer Science and Network Security 8.2: 339-344. 2008
11. [11] Mikheev, Andrei, Marc Moens, and Claire Grover. "Named entity recognition without gazetteers." Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1999. [DOI:10.3115/977035.977037]
12. [12] Rau, Lisa F. "Extracting company names from text." Artificial Intelligence Applications, 1991. Proceedings., Seventh IEEE Conference on. Vol. 1. IEEE, 1991. [DOI:10.1109/CAIA.1991.120841]
13. [13] Shaalan, Khaled, and Hafsa Raza. "Person name entity recognition for Arabic." Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources. Association for Computational Linguistics, 2007. [DOI:10.3115/1654576.1654581]
14. [14] Tjong Kim Sang, Erik F., and Fien De Meulder. "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition." Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003.

Add your comments about this article : Your username or Email:
Write the security code in the box

© 2015 All Rights Reserved | Signal and Data Processing