Volume 16, Issue 1 (5-2019)                   JSDP 2019, 16(1): 91-110 | Back to browse issues page

XML Persian Abstract Print

Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Shahshahani M S, Mohseni M, Shakery A, Faili H. PAYMA: A Tagged Corpus of Persian Named Entities. JSDP. 2019; 16 (1) :91-110
URL: http://jsdp.rcisp.ac.ir/article-1-769-en.html
College of Engineering, University of Tehran
Abstract:   (377 Views)
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies on this task in Persian. One of the main important reasons for this may be the lack of a standard Persian NER dataset to train and test the NER systems. In this research we create a standard tagged Persian NER dataset which will be distributed freely for research purposes. In order to construct this standard dataset, we studied the existing standard NER datasets in English and came to the conclusion that almost all of these datasets are constructed using news data. Thus we collected documents from ten news websites in Persian. In the next step, in order to provide the annotators with guidelines to tag these documents, we studied the guidelines used for constructing CoNLL and MUC English datasets and created our own guidelines considering the Persian linguistic rules. Using these guidelines, all words in documents can be labeled as person, location, organization, time, date, percent, currency, or other (words that are not in any of these 7 classes). We use IOB encoding for annotating named entities in documents, like most of the existing English NER datasets. Using this encoding, the first token of a named entity is labeled with B, and the next tokens (if exist) are labeled with I. The words that are not part of any named entity are labeled with O. The constructed corpus, named PAYMA, consists of 709 documents and includes 302530 tokens. 41148 tokens out of these tokens are labeled as named entities and the others are labeled as O. In order to determine the inter-annotator agreement, 160 documents were labeled by a second annotator. Kappa statistic was estimated as 95% using words that are labeled as named entities. After creating the dataset, we used the dataset to design a hybrid system for named entity recognition. We trained a statistical system based on the CRF algorithm, and used its output as a feature to train a bidirectional LSTM recurrent neural network. Moreover, we used the k-means word clustering method to cluster the words and fed the cluster number of each word to the LSTM neural network. This form of combining CRF with neural networks and using the cluster number for each word is the novelty of this research work. Experimental results show that the final model can reach an F1 score of 87% at word-level and 80% at phrase level.
Full-Text [PDF 5341 kb]   (129 Downloads)    
Type of Study: Applicable | Subject: Paper
Received: 2017/12/16 | Accepted: 2019/02/24 | Published: 2019/06/10 | ePublished: 2019/06/10

1. [1] S. A. Esfahani, S. Rahati Ghouchani, and N. Jahangiri, "Persian named entity recognition and classification", Journal of Signal and Data Processing, vol. 7, no. 1, 2010.
2. [2] M. Abdous, "Recognizing Persian Named Entities Using Persian Wikipedia Content", M.S Thesis, Iran University of Science and Technology, Tehran, Iran, 2015.
3. [3] M. Abdous and B. Minaei Bidgoli, "Improving Named Entity Recognition Using Izafe in Farsi", Journal of Signal and Data Processing, vol. 14, no. 4, 2017. [DOI:10.29252/jsdp.14.4.43]
4. [4] P. S. Mortazavi, M. Shamsfard, "Named Entity Recognition in Persian Texts", in 15th National CSI Computer Conference, Tehran, Iran, 2009.
5. [5] F. Ahmadi and H. Moradi, "A Hybrid Method for Persian Named Entity Recognition," in 7th Internatonal Conference on Information Know-ledge Technology, 2015. [DOI:10.1109/IKT.2015.7288806] [PMCID]
6. [6] D. M. Bikel, S. Miller, R. M. Schwartz, and R. Weischedel, "Nymble: A High-Performance Learning Name-Finder", in Proceedings of the fifth conference on Applied natural language process-ing, pp. 194-201, 1997. [DOI:10.3115/974557.974586]
7. [7] A. Borthwick and J. Sterling, "NYU: Description of the MENE Named Entity System as used in MUC-7," Proceedings of the 7th Message Understanding Conference (MUC-7), 1998.
8. [8] A. X. Chang and C. D. Manning, "TOKENS REGEX : Defining Cascaded Regular Expressions over Tokens," Stanford University Technical Report, 2004.
9. [9] A. Chinchor, "OVERVIEW OF MUC-7 / MET-2 Overviews of English and Multilingual Tasks," in Proceedings of Seventh Message Understanding Conference (MUC-7): Proceedings of a Con-ference Held in Fairfax, Virginia, April 2, 1997.
10. [10] J. P. C. Chiu and E. Nichols, "Named Entity Recognition with Bidirectional LSTM-CNNs," in Transactions of the Association for Compu-tational Linguistics, vol. 4 pp. 357-370, 2016. [DOI:10.1162/tacl_a_00104]
11. [11] C. dos Santos and V. Guimar, "Boosting Named Entity Recognition with Neural Character Embeddings," in Fifth Named Entity Recognition Workshop, joint with 53rd ACL and the 7th IJCNLP, 2015, pp. 25-33. [DOI:10.18653/v1/W15-3904] [PMCID]
12. [12] J. R. Finkel, T. Grenager, and C. Manning, "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling," in Proceedings of the 43rd annual meeting on association for computational linguistics, 2005. [DOI:10.3115/1219840.1219885]
13. [13] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd editio. Prentice-Hall, 2009.
14. [14] M. K. Khormuji and M. Bazrafkan, "Persian Named Entity Recognition based with Local Filters," International Journal of Computer Applications, vol. 100, no. 4, pp. 1-6, 2014. [DOI:10.5120/17510-8062]
15. [15] M. Konkol, T. Brychcín, and M. Konopík, "Latent semantics in Named Entity Recognition," Expert Systems with Applications, vol. 42, no. 7, pp. 3470-3479, 2015. [DOI:10.1016/j.eswa.2014.12.015]
16. [16] G. Kumaran and J. Allan, "Text Classification and Named Entities for New Event Detection," in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004, pp. 297-304. [DOI:10.1145/1008992.1009044]
17. [17] J. Lafferty and A. Mccallum, "Conditional Random Fields : Probabilistic Models for Segmenting and Labeling Sequence Data Conditional Random Fields : Probabilistic Models for Segmenting and," in Proceedings of the eighteenth international conference on machine learning, ICML, 2001, vol. 1, no. June, pp. 282-289.
18. [18] G. Lample, M. Ballesteros, S. Subramaninan, K. Kawakami, and C. Dyer, "Neural Architectures for Named Entity Recognition," in Proceedings of NAACL-HLT 2016, 2016, no. July. [DOI:10.18653/v1/N16-1030]
19. [19] A. McCallum and W. Li, "Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons," Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, vol. 4, 2003,pp. 188-191. [DOI:10.3115/1119176.1119206]
20. [20] T. Mikolov, G. Corrado, K. Chen, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in Proceedings of the International Conference on Learning Represen-tations (ICLR 2013), 2013, pp. 1-12.
21. [21] S. Miller, J. Guinness, and A. Zamanian, "Name Tagging with Word Clusters and Discriminative Training," in Proceedings of HLT-NAACL, 2004.
22. [22] D. Molla, Me. van Zaanen, and D. Smith, "Named Entity Recogntion for Question Answering," Proceedings of the 2006 Aus-tralasian language technology workshop, vol. 4, 2006, pp. 51-58.
23. [23] D. Nadeau, "A Survey of Named Entity Recognition and Classification," Linguisticae Investigationes, no. 30, p. 3-26., 2007. [DOI:10.1075/li.30.1.03nad]
24. [24] M. Pasca, "Acquisition of Categorized Named Entities for Web Search," Thirteenth ACM international conference on Information and knowledge management, 2004, pp. 137-145. [DOI:10.1145/1031171.1031194]
25. [25] T. Poibeau and L. Kosseim, "Proper Name Extraction from Non-Journalistic Texts," in Proc. Computational Linguistics in the Netherlands, 2001, pp. 144-157. [DOI:10.1163/9789004333901_011]
26. [26] H. Poostchi and M. Piccardi, "PersoNER : Persian Named-Entity Recognition," in Proceedings of Coling 2016, the 26th International Conference on Computational Linguistics, 2016, pp. 3381-3389.
27. [27] M. Seok, H. Song, C. Park, J. Kim, and Y. Kim, "Named Entity Recognition using Word Embedding as a Feature 1," International Journal of Software Engineering and Its Applications, vol. 10, no. 2, pp. 93-104, 2016. [DOI:10.14257/ijseia.2016.10.2.08]
28. [28] S. K. Sienˇ, "Adapting word2vec to Named Entity Recognition," in Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, 2015, pp. 239-243.
29. [29] B. M. Sundheim, "Overview of Results of the MUC-6 Evaluation," in Proceedings of the 6th conference on Message understanding. Association for Computational Linguistics, 1996, pp. 13-31. [DOI:10.3115/1072399.1072402]
30. [30] E. F. Tjong, K. Sang, and F. De Meulder, "Language-Independent Named Entity Recognition," in Proc. CoNLL, 2003. [DOI:10.3115/1118853.1118878]
31. [31] J. Turian, L. Ratinov, Y. Bengio, and J. Turian, "Word Representations: A Simple and General Method for Semi-supervised Learning," Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, no. July, pp. 384-394, 2010.

Add your comments about this article : Your username or Email:

Send email to the article author

© 2015 All Rights Reserved | Signal and Data Processing