A’laam Corpus: A Standard Corpus of Named Entity for Persian Language

Hosseinnejad, Shadi; Shekofteh, Yasser; Emami Azadi , Tahereh

doi:10.29252/jsdp.14.3.127

Volume 14, Issue 3 (12-2017) JSDP 2017, 14(3): 127-142 | Back to browse issues page

‎ 10.29252/jsdp.14.3.127

Mendeley

Zotero

RefWorks

Hosseinnejad S, Shekofteh Y, Emami Azadi T. A’laam Corpus: A Standard Corpus of Named Entity for Persian Language. JSDP 2017; 14 (3) :127-142
URL: http://jsdp.rcisp.ac.ir/article-1-477-en.html

A’laam Corpus: A Standard Corpus of Named Entity for Persian Language

Shadi Hosseinnejad

, Yasser Shekofteh ^*

, Tahereh Emami Azadi

Abstract: (10108 Views)

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named entities include the names of persons, organizations, locations (e.g. city and country), expressions of times, quantities, monetary expressions, and percentages. In general, corpus-based NER approaches have been proved to be well suited for NER problem. Using a NER corpus, recognition of named entities can be done through ruled-based or machine-learning methods.
Corpus-based NER systems need standard and appropriate annotated corpora. However, such corpora mainly exist in languages such as English, and are rarely found in Persian/Farsi or limited in volume. So, this paper is dedicated to describe the producing procedure of a standard named entity (NE) corpus - A’laam corpus - for Persian language. A’laam corpus contains about 250,000 tokens tagged with 13 NE tags. This corpus has been developed in the Research Center for Development of Advanced Technologies (RCDAT). Tokens of A’laam corpus are a part of Farsi Text Corpus. The Farsi Text Corpus is a standard Farsi corpus. This corpus, containing more than 100 million Farsi words, has been developed by the Research Center of Intelligent Signal Processing (changed to the Research Center for Development of Advanced Technologies in 2013). The words of this corpus, selected from diverse written and spoken sources, was tokenized and corrected manually. In addition, a part of the Farsi Text Corpus with 8 million words has part-of-speech (POS) tags at word level. Totally, about 8,400 sentences of the Farsi Text Corpus have been randomly selected to obtain about 250,000 tokens of A’laam Corpus. This corpus included words, POS tags, and named entity tags.
To evaluate A’laam corpus, a Persian NER system was trained based on this corpus. This corpus was so divided into the train and test sections. The train section accounted for 90% of the corpus and the remaining 10% belonged to the test section. Using Conditional Random Fields (CRF) method, the Persian NER system resulted in a 92.94% Precision and 78.48% Recall.

Keywords: Natural language Processing, Named Entity Recognition, Named Entity Corpus, Machine learning, Conditional Random Field

Full-Text [PDF 14318 kb] (4723 Downloads)

Type of Study: Applicable | Subject: Paper
Received: 2016/01/17 | Accepted: 2017/03/5 | Published: 2018/01/29 | ePublished: 2018/01/29

References

1. [1] S. A. Esfahani, S. R. Ghuchani, and N. Jahangirim, "Recognition system of names in Persian texts," JSDP, vol. 13, no. 1, pp. 77-88, 1389.

2. [2] P. S. Mortazavi and M. Shamsfard, "Recognition of named entities in Persian texts," in 15-th annual conference of computer society of Iran, Tehran, 1388.

3. [3] M. Abdoos, B. M. Bidgoli, and H. Ghadmanan, "Production of persian named entity corpus NaExtractiing person names using name candidate injection in a coditional random filed model for Arabic language," in the first national conference on corpus linguistics, Tehran, 1394. [PMID]

4. [4] M. A. Bidhendi and B. M. Bidgoli, "Extracting person names using name candidate injection in a conditional random filed model for Arabic language," JSDP, vol. 11, no. 21, pp. 73-85, 2014.

5. [5] S. Armstrong-Warwick, et al., "Data in your language: the ECI multilingual corpus." In Proceedings of the International Workshop on Sharable Natural Language Resources, 1994.

6. [6] Y. Benajiba, P. Rosso, and J. BenediRuiz, "ANER-sys: An Arabic Named Entity Recognition system based on Maximum Entropy," Computational Linguistics and Intelligent Text Processing. pp. 143-153. 2007.

7. [7] M. Bijankhan, J. Sheykhzadegan, M. Bahrani, and M. Ghayoomi, "Lessons from Building a Persian Written Corpus: Peykare," Language Resources and Evaluation, vol. 45, no. 2, pp. 143–164. 2011. [DOI:10.1007/s10579-010-9132-x]

8. [8] D. M. Bikel, S. Miller, R. Schwartz, R. Weischedel, "Nymble: a High-Performance Learning Name-finder". in Proceedings of Conference on Applied Natural Language Processing. 1997. [DOI:10.3115/974557.974586]

9. [9] A. Borthwick, J. Sterling, E. Agichtein, E, and R. Grishman. "NYU: Description of the MENE Named Entity System as used in MUC-7". in Proceedings of the Seventh Message Understanding Conference, 1998.

10. [10] W. Che, M. Wang, C. D. Manning, and T. Liu, "Named Entity Recognition with Bilingual Constraints," In HLT-NAACL, pp. 52-62, 2013.

11. [11] N. Chinchor and P. Robinson, "MUC-7 named entity task definition." in Proceedings of the 7th Conference on Message Understanding, 1997.

12. [12] N. Chinchor, et al., "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999.

13. [13] L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. "Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks," in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1002–1012. 2010.

14. [14] A. Cucchiarelli and P. Velardi, "Unsupervised named entity recognition using syntactic and semantic contextual evidence," Computational Linguistics, vol. 27, no.1, pp. 123-131, 2001. [DOI:10.1162/089120101300346822]

15. [15] G. R. Doddington, et al., "The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation," in Proceedings of LREC, 2004.

16. [16] M. El-Haj and R. Koulali, "KALIMAT a multipurpose Arabic Corpus," in Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22-25, 2013.

17. [17] R. Grishman and B. Sundheim, "Message Understanding Conference-6: A Brief History," in The 16th International Conference on Computational Linguistics COLING, 1996. [DOI:10.3115/992628.992709]

18. [18] W. Liao and S. Veeramachaneni, "A Simple Semi-supervised Algorithm For Named Entity Recognition". In Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing, pp. 58–65, 2009. [DOI:10.3115/1621829.1621837]

19. [19] M. K. Khormuji and M. Bazrafkan, "Persian Named Entity Recognition based with Local Filters". International Journal of Computer Applications, vol. 100, no. 4, 2014.

20. [20] B. Magnini, M. Negri, R. Prevete and H. Tanev, "A WordNet-based approach to Named Entities recognition,". in Proceeding SEMANET'02 Proceedings of the 2002 workshop on Building and using semantic networks, vol. 11, pp. 1-7, 2002. [DOI:10.3115/1118735.1118744]

21. [21] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of English: The Penn Treebank," Computational linguistics, vol. 19, no. 2, pp. 313-330, 1993. [DOI:10.21236/ADA273556]

22. [22] A. McCallum and W. Li, "Early results for named entity recognition with conditional random fields, feature Induction and Web-enhanced Lexicons," in Proceedings of CONLL, pp. 188–191, 2003. [DOI:10.3115/1119176.1119206]

23. [23] H. Moradi and F. Ahmadi, "A hybrud approach for Persian Named Entity Recognition," in 7th conference on information and knowledge Technology (IKT), Urmia, Iran. 2015.

24. [24] D. Nadeau and S. Sekine, "A survey of named entity recognition and classification," Lingvisticae Investigationes, vol. 30, no. 1, pp. 3-26, 2007. [DOI:10.1075/li.30.1.03nad]

25. [25] M. Oudah and K. F. Shaalan. "A Pipeline Arabic Named Entity Recognition using a Hybrid Approach," in Proceedings of CoNLL, 2012. [PMID]

26. [26] D. Palmer, and et al., "A statistical profile of the named entity task," in Proceedings of the fifth conference on Applied natural language processing, 1997. [DOI:10.3115/974557.974585]

27. [27] T. Poibeau, "The multilingual named entity recognition framework," in Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, 2003. [DOI:10.3115/1067737.1067772]

28. [28] S. Pradhan, and et al., "OntoNotes: A Unified Relational Semantic Representation," in International Conference on Semantic Computing, pp. 405–419, 2007. https://doi.org/10.1142/S1793351X07000251 [DOI:10.1109/ICSC.2007.83]

29. [29] T. Rocktschel, M. Weidlich, and U. Leser, "ChemSpot: a hybrid system for chemical named entity recognition," Bioinformatics, vol. 28, pp. 1633-1640, 2012. [DOI:10.1093/bioinformatics/bts183] [PMID]

30. [30] T. Rose, M. Stevenson, and M. Whitehead, "The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources," in Proceedings of LREC, vol. 2, 2002.

31. [31] T. J. Sang and F. Erik, "Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition," in Proceedings of CoNLL, pp. 155–158, Taipei, Taiwan, 2002.

32. [32] T. K. Sang, F. Erik, and F. De Meulder, "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition," in Proceedings of the seventh conference on Natural language learning at HLT-NAACL vol. 4, 2003. [DOI:10.3115/1119176.1119195]

33. [33] D. S. Diana, et al., "Harem: An advanced ner evaluation contest for portuguese," in Proceedings of LREC, 2006.

34. [34] S. Sekine and H. Isahara, "IREX: IR & IE Evaluation Project in Japanese," in Proceedings of LREC, 2000.

35. [35] S. Sekine and C. Nobata, "Definition, Dictionaries and tagger for extended named entity Hierarchy," in Proceedings of conference on Language Resources and Evaluation, 2004.

36. [36] S. Rahul, "Named Entity Recognition: A Literature Survey," 2014.

37. [37] R. Weischedel and A. Brunstein, "BBN Pronoun Coreference and Entity Type Corpus LDC2005T33," in Web Download. Philadelphia: Linguistic Data Consortium, 2005.

38. [38] F. Wu, and S. Weld, "Open information extraction using wikipedia," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 118–127, 2010. [PMCID]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote