Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

sadeghzadeh, mohammadbagher; razzazi, mohammadreza; ghayoomi, Masood

doi:10.29252/jsdp.16.3.36

Volume 16, Issue 3 (12-2019) JSDP 2019, 16(3): 36-23 | Back to browse issues page

‎ 10.29252/jsdp.16.3.36

Mendeley

Zotero

RefWorks

sadeghzadeh M, razzazi M, ghayoomi M. Studying impressive parameters on the performance of Persian probabilistic context free grammar parser. JSDP 2019; 16 (3) :36-23
URL: http://jsdp.rcisp.ac.ir/article-1-385-en.html

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

Mohammadbagher Sadeghzadeh ^*

, Mohammadreza Razzazi

, Masood Ghayoomi

amirkabir university of technology

Abstract: (2971 Views)

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. For example, annotated tree bank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.
The natural language parser consists of two basic parts, POS tagger and the syntax parser. A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some languages and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.
Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Inaccurate design of context-free grammars and using bad structures such as Chomsky normal form can reduce accuracy of probabilistic context-free grammar parser.
Weak independence assumption is one of the problems related to CFG. We have tried to improve this problem with parent and child annotation, which copies the label of a parent node onto the labels of its children, and it can improve the performance of a PCFG.
In grammar, a conjunction (conj) is a part of speech that connects words, phrases, or clauses that are called the conjuncts of the conjunctions. In this study, we examined the conjunction phrases in the Persian tree bank. The results of this study show that adding structural dependencies to grammars and modifying the basic rules can remove conjunction ambiguity and increase accuracy of probabilistic context-free grammar parser.
When a part-of-speech (PoS) tagger assigns word class labels to tokens, it has to select from a set of possible labels whose size usually ranges from fifty to several hundred labels depending on the language. In this study, we have investigated the effect of fine and coarse grain POS tags and merging non-terminals on Persian PCFG parser.

Keywords: Probabilistic context free grammar, parser, tree bank, conjunction phrases, parent annotation, child annotation, part of speech tags

Full-Text [PDF 3868 kb] (956 Downloads)

Type of Study: Research | Subject: Paper
Received: 2018/04/25 | Accepted: 2019/07/10 | Published: 2020/01/7 | ePublished: 2020/01/7

References

1. [1] J. E. Hopcroft, R. Motwani, and J. D. Ullman, "Automata theory, languages, and computation," International Edition, vol. 24, 2006.

2. [2] N. Chomsky, Syntactic structures. Walter de Gruy-ter, 2002.

3. [3] E. Charniak and M. Johnson, "Coarse-to-fine n-best parsing and MaxEnt discriminative reranking," in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics: Association for Computational Linguistics, pp. 173-180. 2005.

4. [4] S. Green and C. D. Manning, "Better Arabic parsing: Baselines, evaluations, and analysis," in Proceedings of the 23rd International Conference on Computational Linguistics: Association for Computational Linguistics, pp. 394-402, 2010.

5. [5] M. Ghayoomi, "From Grammar Rule Extraction to Treebanking: A Bootstrapping Approach," in LREC, 2012, pp. 1912-1919.

6. [6] M. Razzazi, "Independent research at Amirkabir University Of Technology", 2006.

7. [7] A. Astiri, M. Kahani, R. Saeidi ,and A. Asgariyan, "Designing a parser for persian language",Inter-national Conference on persian language pro-cessing, 2012.

8. [8] H. Feili ,and G. Ghassem-Sani, "Unsupervised grammar induction using history based approach," Computer Speech & Language, vol. 20, no. 4, pp. 644-658, 2006.

9. [9] K. Lari and S. J. Young, "The estimation of stochastic context-free grammars using the inside-outside algorithm," Computer Speech & Language, vol. 4, no. 1, pp. 35-56, 1990.

10. [10] Sh.A. Poor, M.H. Poor ,and M. BijanKhan, "Identifying the location of the excess in Persian using PCFG", In Procedings of 13th Conference of Computer Society of Iran, 2008.

11. [11] M. Ghayoomi, "Persian Treebank and Autoa-mtion Parser", In Procedings of Computational Lingustic of Iran, 2013.

12. [12] M. Ghayoomi, "Word clustering for Persian statistical parsing," in Advances in Natural Language Processing: Springer, 2012, pp. 126-137.

13. [13] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, "Class-based n-gram models of natural language," Computational linguistics, vol. 18, no. 4, pp. 467-479, 1992.

14. [14] D. Jurafsky and J. H. Martin, Speech and Language Processing. Prentice Hall, Pearson Education International, 2014.

15. [15] T. L. Booth, "Probabilistic representation of formal languages," in Switching and Automata Theory, 1969., IEEE Conference Record of 10th Annual Symposium on, pp. 74-81, 1969.

16. [16] C. D. Manning and H. Schütze, Foundations of statistical natural language processing. MIT press, 1999.

17. [17] A. Bies et al., "Bracketing guidelines for treebank II style Penn Treebank project. Philadelphia: Linguistic Data Consortium," ed, 2013.

18. [18] C. Pollard, Head-driven phrase structure grammar, University of Chicago Press, 1994.

19. [19] M. Sadeghzadeh, M.Razzazi and M. Ghayoomi, "Investigating effective factors on Persian Parser", In Proceedings of the 3th Conference on Computatinal Lingustics, Tehran, 2013.

20. [20] D. Klein and C. D. Manning, "A parsing: fast exact Viterbi parse selection," in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Association for Computational Linguistics,vol.1, pp. 40-47, 2003.

21. [21] S. Bird, "NLTK: the natural language toolkit," in Proceedings of the COLING/ACL on Interactive presentation sessions: Association for Computational Linguistics, pp. 69-72, 2006.

22. [22] S. Abney and et al., "Procedure for quantitatively comparing the syntactic coverage of English grammars," in Proceedings of the workshop on Speech and Natural Language: Association for Computational Linguistics, pp. 306-311, 1991.

23. [23] K. Megerdoomian, "Developing a Persian part of speech tagger," in Proceedings of the 1st Workshop on Persian Language and Computer, , pp. 99-105, 2004.

24. [24] E. Rahimtoroghi, H. Faili, and A. Shakery, "A structural rule-based stemmer for Persian," in Telecommunications (IST), 2010 5th Inter-national Symposium on, 2010: IEEE, pp. 574-578, 2010.

25. [25] M. Mohseni and B. Minaei-Bidgoli, "A Persian Part-Of-Speech Tagger Based on Morphological Analysis," in LREC, 2010.

26. [26] M. Johnson, "The effect of alternative tree representations on tree bank grammars," in Proceedings of the Joint Conferences on New Methods in Language Processing and Computa-tional Natural Language Learning: Association for Computational Linguis-tics, pp. 39-48, 1998.

27. [27] M. Sadeghzadeh, M. Razzazi ,and H. Mahmoodi, " Injecting Structural Dependency into Persian PCFG", In Proceedings of 20th Conference of Computer Society of Iran, 2013.

28. [28] M. Collins, "Head-driven statistical models for natural language parsing," Computational linguistics, vol. 29, no. 4, pp. 589-637, 2003.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote