Signal and Data Processing

fa پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی A’laam Corpus: A Standard Corpus of Named Entity for Persian Language مقالات پردازش متن Paper كاربردي Applicable تشخیص واحدهای اسمی یکی از مسائل مطرح در پردازش زبان طبیعی است. کاربرد عمده شناسایی واحدهای اسمی در سامانه‌های خلاصه‌ساز متون، استخراج اطلاعات، پرسش و پاسخ، ترجمه ماشینی و دسته‌بندی اسناد است. یکی از روش‌های تهیه سامانه تشخیص واحدهای اسمی، استفاده از روش‌های مبتنی بر پیکره است. این مقاله نحوه و مراحل تهیه پیکره اَعلام – یک پیکره استاندارد با برچسب واحدهای اسمی برای زبان فارسی- را شرح می‌دهد. مجموعه تهیه‌شده با داشتن سیزده برچسب واحدهای اسمی و حجم 250 هزار کلمه نیاز سامانه‌های برچسب‌گذاری خودکار در حوزه پردازش زبان طبیعی فارسی را برآورده می‌کند. با استفاده از این پیکره و به‌کارگیری روش یادگیری ماشین میدان تصادفی شرطی، سامانه‌ای برای شناسایی واحدهای اسمی جملات فارسی تهیه شده که دارای دقت 94/92 درصد و فراخوانی 48/78 درصد است.   Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named entities include the names of persons, organizations, locations (e.g. city and country), expressions of times, quantities, monetary expressions, and percentages. In general, corpus-based NER approaches have been proved to be well suited for NER problem. Using a NER corpus, recognition of named entities can be done through ruled-based or machine-learning methods.       Corpus-based NER systems need standard and appropriate annotated corpora. However, such corpora mainly exist in languages such as English, and are rarely found in Persian/Farsi or limited in volume. So, this paper is dedicated to describe the producing procedure of a standard named entity (NE) corpus - A’laam corpus - for Persian language. A’laam corpus contains about 250,000 tokens tagged with 13 NE tags. This corpus has been developed in the Research Center for Development of Advanced Technologies (RCDAT). Tokens of A’laam corpus are a part of Farsi Text Corpus. The Farsi Text Corpus is a standard Farsi corpus. This corpus, containing more than 100 million Farsi words, has been developed by the Research Center of Intelligent Signal Processing (changed to the Research Center for Development of Advanced Technologies in 2013). The words of this corpus, selected from diverse written and spoken sources, was tokenized and corrected manually. In addition, a part of the Farsi Text Corpus with 8 million words has part-of-speech (POS) tags at word level. Totally, about 8,400 sentences of the Farsi Text Corpus have been randomly selected to obtain about 250,000 tokens of A’laam Corpus. This corpus included words, POS tags, and named entity tags.       To evaluate A’laam corpus, a Persian NER system was trained based on this corpus. This corpus was so divided into the train and test sections. The train section accounted for 90% of the corpus and the remaining 10% belonged to the test section. Using Conditional Random Fields (CRF) method, the Persian NER system resulted in a 92.94% Precision and 78.48% Recall.   پردازش زبان طبیعی, تشخیص واحدهای اسمی, پیکره واحدهای اسمی, یادگیری ماشین, میدان تصادفی شرطی Natural language Processing, Named Entity Recognition, Named Entity Corpus, Machine learning, Conditional Random Field 127 142 http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-306-3&slc_lang=fa&sid=1 Shadi Hosseinnejad شادی حسین‌نژاد hosseinnjad@rcdat.ir 10031947532846006898 10031947532846006898 No پژوهشگاه توسعه فناوری‌های پیشرفته خواجه نصیرالدین طوسی Yasser Shekofteh یاسر شکفته shekofteh@rcdat.ir 10031947532846006899 10031947532846006899 Yes دانشکده مهندسی و علوم کامپیوتر، دانشگاه شهید بهشتی Tahereh Emami Azadi طاهره امامی آزادی t.emami@rcdat.ir 10031947532846006900 10031947532846006900 No پژوهشگاه توسعه فناوری‌های پیشرفته خواجه نصیرالدین طوسی