Corpus based coreference resolution for Farsi text

Rahimi, Zeinab; HosseinNejad, Shadi

doi:10.29252/jsdp.17.1.79

Volume 17, Issue 1 (6-2020) JSDP 2020, 17(1): 79-98 | Back to browse issues page

‎ 10.29252/jsdp.17.1.79

Corpus based coreference resolution for Farsi text

Zeinab Rahimi ^*

, Shadi HosseinNejad

Research Center for Development of Advanced Technology (RCDAT)

Abstract: (4299 Views)

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be used in many natural language processing tasks such as machine translation, automatic text summarization, question answering, and information extraction systems. Adding coreference information can increase the power of natural language processing systems.
The coreference resolution can be done through different ways. These methods include heuristic rule-based methods and supervised/unsupervised machine learning methods. Corpus based and machine learning based methods are widely used in coreference resolution task in recent years and has led to a good performance. For using such these methods, there is a need for manually labeled corpus with sufficient size. For Persian language, before this research, there exists no such corpus. One of the important targets here, was producing a through corpus that can be used in coreference resolution task and other associated fields in linguistics and computational linguistics.
In this coreference resolution research, a corpus of coreference tagged phrases has been generated (manually annotated) that has about one million words. It also has named entity recognition (NER) tags. Named entity labels in this corpus include 7 labels and in coreference task, all noun phrases, pronouns and named entities have been tagged. Using this corpus, a coreference tool was created using a vector space machine, with precision of about 60% on golden test data.
As mentioned before, this article presents the procedure for producing a coreference resolution tool. This tool is produced by machine learning method and is based on the tagged corpus of 900 thousand tokens. In the production of the system, several different features and tools have been used, each of which has an effect on the accuracy of the whole tool. Increasing the number of features, especially semantic features, can be effective in improving results. Currently, according to the sources available in the Persian language, there are no suitable syntactic and semantic tools, and this research suffers from this perspective.
The coreference tagged corpus produced in this study is more than 500 times bigger than the previous Persian language corpora and at the same time it is quite comparable to the prominent ACE and Ontonotes corpora.
The system produced has an f-measure of nearly 60 according to the CoNLL standard criterion. However, other limited studies conducted in Farsi have provided different accuracy from 40 to 90%, which is not comparable to the present study, because the accuracy of these studies has not been measured with standard criterion in the coreference resolution field.

Keywords: Automatic coreference resolution, Anaphora resolution, mention

Full-Text [PDF 7959 kb] (1161 Downloads)

Type of Study: Research | Subject: Paper
Received: 2018/06/9 | Accepted: 2019/06/1 | Published: 2020/06/21 | ePublished: 2020/06/21

References

1. [1] Sh. Tabatabaee and Y. Shekofteh, "The basic coreference resolution system for noun phrases in Persian language using simple rules", The first conference on national search engine, Tehran 2015.

2. [2] B. Amit and B. Baldwin, "Algorithms for scoring coreference chains", The first international conference on language resources and evaluation workshop on linguistics coreference. Vol. 1. 1998.

3. [3] B. Eric and D. Roth, "Understanding the value of features for coreference resolution," Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2008.

4. [4] B. Baldwin , M. Collins , J. Eisner , A. Ratnaparkhi , J. Rosenzweig and A. Sarkar, University of Pennsylvania: description of the University of Pennsylvania system used for MUC-6, Proceedings of the 6th conference on Message understanding, November 06-08, 1995, Columbia, Maryland. [DOI:10.3115/1072399.1072416]

5. [5] Ch. Chen and Ng.Vincent, "Combining the best of two worlds: A hybrid approach to multilingual coreference resolution," Joint Conference on EMNLP and CoNLL-Shared Task, Association for Computational Linguistics, 2012.

6. [6] N. Chinchor and S. Beth, "Message understanding conference (MUC) 6," LDC2003T13 (2003(, 2013.

7. [7] N. Chinchor, "Message Understanding Conference (MUC) 7", LDC2001T02. Web Download. Philadelphia: Linguistic Data Conso-rtium, 2001.

8. [8] C. Jacob , "A coeﬃcient of agreement for nominal scales", Educational and Psychological Measurement, vol. 20, 1960, pp.37-46. [DOI:10.1177/001316446002000104]

9. [9] G. Doddington, A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, and R. Weischedel , "The automatic content extraction (ace) program-tasks, data, and evaluatio"n"., In LREC, vol. 2, pp. 1, 2004.

10. [10] D. Greg, D. Leo Wright Hall and D. Klein, "Decentralized Entity-Level Modeling for Coreference Resolution," ACL (1), 2013.

11. [11] F. Fallahi and M. Shamsfard, "Recognizing anaphora reference in Persian sentences," Int. J. Comput. Sci, vol. 8, pp. 324-329, pp. 2011.

12. [12] A.M. Green,"Kappa statistics for multiple raters using categorical classiﬁcations", In Proceedings of the Twenty, 1997.

13. [13] A. Haghighi and D. Klein, "Simple coreference resolution with rich syntactic and semantic features", In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing : Association for Computational Linguistics, Vol. 3, pp. 1152-1161, 2009. [DOI:10.3115/1699648.1699661]

14. [14] A. Haghighi and K. Dan, "Unsupervised coreference resolution in a nonparametric bayesian model," Annual meeting-Association for Computational Linguistics. vol. 45. No. 1. 2007.

15. [15] Sh. Hosseinnejad, Y. Shekofteh, & T. Emami Azadi, "A'laam Corpus: A Standard Corpus of Named Entity for Persian Language", Signal and Data Processing, vol.14, pp.127-142, 2017. [DOI:10.29252/jsdp.14.3.127]

16. [16] H. Lee, et al, "Joint entity and event coreference resolution across documents," Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computa-tional Linguistics, 2012.

17. [17] X. Luo, "On coreference resolution performance metrics," Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2005. [DOI:10.3115/1220575.1220579]

18. [18] N. S. Moosavi and Gh. Ghassem-Sani, "A Ranking Approach to Persian Pronoun Resolution," Advances in Computational Linguistics. Research in Computing Science 41, pp. 169-180, 2009.

19. [19] V. Ng, and C. Cardie, "Improving machine learning approaches to coreference resolution," In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 104-111, 2002. [DOI:10.3115/1073083.1073102]

20. [20] M. Nazaridoust, B. Minaei Bidgoli, S. Nazaridoust, "Co-reference Resolution in Farsi Corpora", Advance Trends in Soft Computing Studies in Fuzziness and Soft Computing, vol. 312, pp.155-162, 2014. [DOI:10.1007/978-3-319-03674-8_15]

21. [21] S. Pradhan et al, "CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes," Joint Conference on EMNLP and CoNLL-Shared Task, Association for Computational Linguistics, 2012.

22. [22] M.S. Rasooli, M. Kouhestani, and A. Moloodi, "Development of a Persian Syntactic Dependency Treebank", In The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), Atlanta, USA.

23. [23] M. Seraji, B. Megyesi, J. Nivre , "Bootstrapping a Persian Dependency Treebank", Published as a Journal in Special Issue of the Linguistic Issues in Language Technology (LiLT), Heidelberg, Germany, 2012.

24. [24] M. Shamsfard, H. Fadaee, "A Hybrid Morphology-Based POS Tagger for Persian", In Proceedings of 6th Language Resources and Evaluation Conference (LREC 2008), Morocco, 2008.

25. [25] M. Stamborg, et a, "Using syntactic dependencies to solve coreferences," Joint Conference on EMNLP and CoNLL-Shared Task. Association for Computational Linguistics, 2012.

26. [26] V. Stoyanov, et al. "Reconciling ontonotes: Unrestricted coreference resolution in ontonotes with reconcile," Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics, 2011.

27. [27] O. Uryupina, M. Alessandro, and Massimo Poesio. "BART goes multilingual: The UniTN/Essex submission to the CoNLL-2012 shared task," Joint Conference on EMNLP and CoNLL-Shared Task, Association for Computational Linguistics, 2012.

28. [28] Y. Versley, et al, "BART: A modular toolkit for coreference resolution," Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session. Association for Computational Linguistics, 2008. [DOI:10.3115/1564144.1564147]

29. [29] M. Vilain, et al, "A model-theoretic coreference scoring scheme," Proceedings of the 6th conference on Message understanding, Association for Computational Linguistics, 1995. [DOI:10.3115/1072399.1072405]

30. [30] S. Wiseman, A. M. Rush and S. M. Shieber, "Learning Global Features for Coreference Resolution," arXiv preprint arXiv:1604.03035, 2016. [DOI:10.18653/v1/N16-1114]

31. [31] A. salimibadr and M.Homayounpour, Phrase chunking in Persian texts . JSDP, vol. 10 (2), pp. 69-86,2014.

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.