Phrase chunking in Persian texts

salimibadr, armin; Homayounpour, Mohammad Mehdi

Signal and Data Processing Journal A scientific journal officially licensed by the Commission for Scientific Publications of the (MSRT). Publisher: Research Ceter for Developmen of Technologies

EN FA

Volume 10, Issue 2 (3-2014) JSDP 2014, 10(2): 69-86 | Back to browse issues page

Mendeley

Zotero

RefWorks

salimibadr A, Homayounpour M M. Phrase chunking in Persian texts . JSDP 2014; 10 (2) :69-86
URL: http://jsdp.rcisp.ac.ir/article-1-73-en.html

Phrase chunking in Persian texts

Armin Salimibadr ^*

, Mohammad Mehdi Homayounpour

Abstract: (12814 Views)

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammatical characteristics of Farsi texts. Many features and labeling methods are examined one by one and the best features and labeling techniques are used for the detection of syntactic phrases and their boundaries. Several machine learning techniques including Support Vector Machine and Conditional Random Fields are used as classifier in our experiments. The impact of the size of training texts on chunking performance was studied as well. Using the proposed methods in this paper, a performance of 84.02% was obtained for detection of phrase boundaries and 78.04% for detection of both phrase boundaries and phrase type

Keywords: Natural language processing, Phrase chunking, POS tagging, Support vector machine, Conditional random fields, Text to speech, Machine translation

Full-Text [PDF 2581 kb] (3527 Downloads)

Type of Study: Research | Subject: Paper
Received: 2013/06/5 | Accepted: 2013/09/10 | Published: 2014/04/8 | ePublished: 2014/04/8

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.