پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

حسین‌نژاد, شادی; شکفته, یاسر; امامی آزادی, طاهره

doi:10.29252/jsdp.14.3.127

***************«بسم الله الرحمن الرحیم» نشریه علمی «پردازش علائم و داده‌ها» با مجوز رسمی از کمیسیون نشریات وزارت علوم، تحقیقات و فناوری، صاحب امتیاز: پژوهشگاه توسعه فناوری‌های پیشرفته ***************

Signal and Data Processing Journal A scientific journal officially licensed by the Commission for Scientific Publications of the (MSRT). Publisher: Research Ceter for Developmen of Technologies

EN FA

دوره 14، شماره 3 - ( 9-1396 ) جلد 14 شماره 3 صفحات 142-127 | برگشت به فهرست نسخه ها

‎ 10.29252/jsdp.14.3.127

Mendeley

Zotero

RefWorks

Hosseinnejad S, Shekofteh Y, Emami Azadi T. A’laam Corpus: A Standard Corpus of Named Entity for Persian Language. JSDP 2017; 14 (3) :127-142
URL: http://jsdp.rcisp.ac.ir/article-1-477-fa.html

حسین‌نژاد شادی، شکفته یاسر، امامی آزادی طاهره. پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی . پردازش علائم و داده‌ها. 1396; 14 (3) :127-142

URL: http://jsdp.rcisp.ac.ir/article-1-477-fa.html

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

شادی حسین‌نژاد

، یاسر شکفته^*

، طاهره امامی آزادی

دانشکده مهندسی و علوم کامپیوتر، دانشگاه شهید بهشتی

چکیده: (11677 مشاهده)

تشخیص واحدهای اسمی یکی از مسائل مطرح در پردازش زبان طبیعی است. کاربرد عمده شناسایی واحدهای اسمی در سامانه‌های خلاصه‌ساز متون، استخراج اطلاعات، پرسش و پاسخ، ترجمه ماشینی و دسته‌بندی اسناد است. یکی از روش‌های تهیه سامانه تشخیص واحدهای اسمی، استفاده از روش‌های مبتنی بر پیکره است. این مقاله نحوه و مراحل تهیه پیکره اَعلام – یک پیکره استاندارد با برچسب واحدهای اسمی برای زبان فارسی- را شرح می‌دهد. مجموعه تهیه‌شده با داشتن سیزده برچسب واحدهای اسمی و حجم 250 هزار کلمه نیاز سامانه‌های برچسب‌گذاری خودکار در حوزه پردازش زبان طبیعی فارسی را برآورده می‌کند. با استفاده از این پیکره و به‌کارگیری روش یادگیری ماشین میدان تصادفی شرطی، سامانه‌ای برای شناسایی واحدهای اسمی جملات فارسی تهیه شده که دارای دقت 94/92 درصد و فراخوانی 48/78 درصد است.

واژه‌های کلیدی: پردازش زبان طبیعی، تشخیص واحدهای اسمی، پیکره واحدهای اسمی، یادگیری ماشین، میدان تصادفی شرطی

متن کامل [PDF 14318 kb] (5049 دریافت)

نوع مطالعه: كاربردي | موضوع مقاله: مقالات پردازش متن
دریافت: 1394/10/27 | پذیرش: 1395/12/15 | انتشار: 1396/11/9 | انتشار الکترونیک: 1396/11/9

فهرست منابع

1. [1] س. ع. اصفهاني، س. راحتي قوچاني و ن. جهانگيري. «سيستم شناسايي و طبقه بندي اسامي در متون فارسی»، پردازش علايم و داده‌ها، دوره 13، شماره 1 (پياپي 13) ; صص. 77-88. 1389.

2. [1] S. A. Esfahani, S. R. Ghuchani, and N. Jahangirim, "Recognition system of names in Persian texts," JSDP, vol. 13, no. 1, pp. 77-88, 1389.

3. [2] پ. سادات‌مرتضوي و م. شمس‌فرد. «شناسایی واحدهای اسمی در متون فارسی.» پانزدهمین کنفرانس بین‌المللی سالانه انجمن کامپیوتر، تهران. 1388.

4. [2] P. S. Mortazavi and M. Shamsfard, "Recognition of named entities in Persian texts," in 15-th annual conference of computer society of Iran, Tehran, 1388.

5. [3] م. عبدوس؛ ب. مینایی بیدگلی و ح. قدمنان. «تولید پیکره واحدهای اسمی فارسی.» اولین همایش ملی زبان‌شناسی پیکره‌ای، تهران. 1394.

6. [3] M. Abdoos, B. M. Bidgoli, and H. Ghadmanan, "Production of persian named entity corpus NaExtractiing person names using name candidate injection in a coditional random filed model for Arabic language," in the first national conference on corpus linguistics, Tehran, 1394. [PMID]

7. [4] م. عسگری بیدهندی و ب. مینایی بیدگلی. «تشخیص اسامی اشخاص با استفاده از افزایش کلمه‌های نامزد اسم در میدان‌های تصادفی شرطی برای زبان عربی»، پردازش علائم و داده ها، دوره 11 شماره 21، صص73 -85 . 1393.

8. [4] M. A. Bidhendi and B. M. Bidgoli, "Extracting person names using name candidate injection in a conditional random filed model for Arabic language," JSDP, vol. 11, no. 21, pp. 73-85, 2014.

9. [5] S. Armstrong-Warwick, et al., "Data in your language: the ECI multilingual corpus." In Proceedings of the International Workshop on Sharable Natural Language Resources, 1994.

10. [6] Y. Benajiba, P. Rosso, and J. BenediRuiz, "ANER-sys: An Arabic Named Entity Recognition system based on Maximum Entropy," Computational Linguistics and Intelligent Text Processing. pp. 143-153. 2007.

11. [7] M. Bijankhan, J. Sheykhzadegan, M. Bahrani, and M. Ghayoomi, "Lessons from Building a Persian Written Corpus: Peykare," Language Resources and Evaluation, vol. 45, no. 2, pp. 143–164. 2011. [DOI:10.1007/s10579-010-9132-x]

12. [8] D. M. Bikel, S. Miller, R. Schwartz, R. Weischedel, "Nymble: a High-Performance Learning Name-finder". in Proceedings of Conference on Applied Natural Language Processing. 1997. [DOI:10.3115/974557.974586]

13. [9] A. Borthwick, J. Sterling, E. Agichtein, E, and R. Grishman. "NYU: Description of the MENE Named Entity System as used in MUC-7". in Proceedings of the Seventh Message Understanding Conference, 1998.

14. [10] W. Che, M. Wang, C. D. Manning, and T. Liu, "Named Entity Recognition with Bilingual Constraints," In HLT-NAACL, pp. 52-62, 2013.

15. [11] N. Chinchor and P. Robinson, "MUC-7 named entity task definition." in Proceedings of the 7th Conference on Message Understanding, 1997.

16. [12] N. Chinchor, et al., "1999 Named Entity Recognition Task Definition," MITRE and SAIC, 1999.

17. [13] L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. "Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks," in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1002–1012. 2010.

18. [14] A. Cucchiarelli and P. Velardi, "Unsupervised named entity recognition using syntactic and semantic contextual evidence," Computational Linguistics, vol. 27, no.1, pp. 123-131, 2001. [DOI:10.1162/089120101300346822]

19. [15] G. R. Doddington, et al., "The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation," in Proceedings of LREC, 2004.

20. [16] M. El-Haj and R. Koulali, "KALIMAT a multipurpose Arabic Corpus," in Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22-25, 2013.

21. [17] R. Grishman and B. Sundheim, "Message Understanding Conference-6: A Brief History," in The 16th International Conference on Computational Linguistics COLING, 1996. [DOI:10.3115/992628.992709]

22. [18] W. Liao and S. Veeramachaneni, "A Simple Semi-supervised Algorithm For Named Entity Recognition". In Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing, pp. 58–65, 2009. [DOI:10.3115/1621829.1621837]

23. [19] M. K. Khormuji and M. Bazrafkan, "Persian Named Entity Recognition based with Local Filters". International Journal of Computer Applications, vol. 100, no. 4, 2014.

24. [20] B. Magnini, M. Negri, R. Prevete and H. Tanev, "A WordNet-based approach to Named Entities recognition,". in Proceeding SEMANET'02 Proceedings of the 2002 workshop on Building and using semantic networks, vol. 11, pp. 1-7, 2002. [DOI:10.3115/1118735.1118744]

25. [21] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of English: The Penn Treebank," Computational linguistics, vol. 19, no. 2, pp. 313-330, 1993. [DOI:10.21236/ADA273556]

26. [22] A. McCallum and W. Li, "Early results for named entity recognition with conditional random fields, feature Induction and Web-enhanced Lexicons," in Proceedings of CONLL, pp. 188–191, 2003. [DOI:10.3115/1119176.1119206]

27. [23] H. Moradi and F. Ahmadi, "A hybrud approach for Persian Named Entity Recognition," in 7th conference on information and knowledge Technology (IKT), Urmia, Iran. 2015.

28. [24] D. Nadeau and S. Sekine, "A survey of named entity recognition and classification," Lingvisticae Investigationes, vol. 30, no. 1, pp. 3-26, 2007. [DOI:10.1075/li.30.1.03nad]

29. [25] M. Oudah and K. F. Shaalan. "A Pipeline Arabic Named Entity Recognition using a Hybrid Approach," in Proceedings of CoNLL, 2012. [PMID]

30. [26] D. Palmer, and et al., "A statistical profile of the named entity task," in Proceedings of the fifth conference on Applied natural language processing, 1997. [DOI:10.3115/974557.974585]

31. [27] T. Poibeau, "The multilingual named entity recognition framework," in Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, 2003. [DOI:10.3115/1067737.1067772]

32. [28] S. Pradhan, and et al., "OntoNotes: A Unified Relational Semantic Representation," in International Conference on Semantic Computing, pp. 405–419, 2007. https://doi.org/10.1142/S1793351X07000251 [DOI:10.1109/ICSC.2007.83]

33. [29] T. Rocktschel, M. Weidlich, and U. Leser, "ChemSpot: a hybrid system for chemical named entity recognition," Bioinformatics, vol. 28, pp. 1633-1640, 2012. [DOI:10.1093/bioinformatics/bts183] [PMID]

34. [30] T. Rose, M. Stevenson, and M. Whitehead, "The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources," in Proceedings of LREC, vol. 2, 2002.

35. [31] T. J. Sang and F. Erik, "Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition," in Proceedings of CoNLL, pp. 155–158, Taipei, Taiwan, 2002.

36. [32] T. K. Sang, F. Erik, and F. De Meulder, "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition," in Proceedings of the seventh conference on Natural language learning at HLT-NAACL vol. 4, 2003. [DOI:10.3115/1119176.1119195]

37. [33] D. S. Diana, et al., "Harem: An advanced ner evaluation contest for portuguese," in Proceedings of LREC, 2006.

38. [34] S. Sekine and H. Isahara, "IREX: IR & IE Evaluation Project in Japanese," in Proceedings of LREC, 2000.

39. [35] S. Sekine and C. Nobata, "Definition, Dictionaries and tagger for extended named entity Hierarchy," in Proceedings of conference on Language Resources and Evaluation, 2004.

40. [36] S. Rahul, "Named Entity Recognition: A Literature Survey," 2014.

41. [37] R. Weischedel and A. Brunstein, "BBN Pronoun Coreference and Entity Type Corpus LDC2005T33," in Web Download. Philadelphia: Linguistic Data Consortium, 2005.

42. [38] F. Wu, and S. Weld, "Open information extraction using wikipedia," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 118–127, 2010. [PMCID]

ارسال پیام به نویسنده مسئول

بازنشر اطلاعات
	این مقاله تحت شرایط Creative Commons Attribution-NonCommercial 4.0 International License قابل بازنشر است.