Volume 12, Issue 2 (9-2015)                   JSDP 2015, 12(2): 55-72 | Back to browse issues page

XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

rahimi Z, samani M H, khadivi S. Extracting parallel corpora from web comparable documents to improve the quality of an English-Farsi translation system. JSDP 2015; 12 (2) :55-72
URL: http://jsdp.rcisp.ac.ir/article-1-190-en.html
RCISP
Abstract:   (6583 Views)
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora look for parallel data at the sentence level. However, we believe that very non-parallel corpora have none or few good sentence pairs most of their parallel data exists at the sub-sentential level. The base system is Manteanu 2006 fragment extraction system implemented in C# and the proposed system is implemented based on extracting fragment blocks from input related sentences using score calculated from special features such as fragment length, LLR score, relevance path specification in the block and translation coverage percent. Evaluations indicates that proposed method outperforms the base system and the improved base system.
Full-Text [PDF 3675 kb]   (1654 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2013/12/12 | Accepted: 2014/08/25 | Published: 2015/09/30 | ePublished: 2015/09/30

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2015 All Rights Reserved | Signal and Data Processing