Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Emami, Hojjat

doi:10.52547/jsdp.19.2.133

Volume 19, Issue 2 (9-2022) JSDP 2022, 19(2): 133-146 | Back to browse issues page

‎ 10.52547/jsdp.19.2.133

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Hojjat Emami ^*

University of Bonab

Abstract: (837 Views)

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, and etc. An information extraction system aims to identify the entities from the text and extract their related information to form a profile of the target entity.
In recent years, several methods have been proposed for extracting structured information from web text. The majority of existing methods for extracting entity-centric information require a predefined ontology. The ontology includes the complete knowledge of the entities and their attributes. The main challenge of these methods is their inability to extract information about entities that are not already defined in the ontology. Besides, the existing methods have ignored semantic information extraction and have not linked the extracted information to the general ontology entries. This highlights that introducing new methods for semantic information extraction is an open problem and there is room for more efforts in this field.
As an element of research, we proposed a new method for the automatic extraction of semantically structured information from Farsi web text. The proposed method does not require background knowledge about the entities and their properties. The proposed method consists of three main phases including pre-processing, semantic analysis and frame extraction. To fulfill these phases, we use a combination of language resources, text processing tools, and distant ontologies. The main focuses of the proposed method are to enrich the predicate-argument frames with the semantic information extracted from distant ontologies, extract the entity-related information from predicate-argument frames, and link the extracted information with their corresponding sense in DBPedia ontology. The issue facilitates the processing of Farsi texts by computers.
To evaluate the proposed method, we created a small Farsi dataset containing 100 complete sentences. Then, the proposed method is compared with three information extraction methods on this dataset. The results of experiments show the superiority of the proposed method compared to counterpart methods in terms of precision and F₁ measures.

Article number: 9

Keywords: Web mining, information extraction, natural language processing, ontology, structured-semantic information

Full-Text [PDF 968 kb] (236 Downloads)

Type of Study: Research | Subject: Paper
Received: 2019/12/26 | Accepted: 2021/06/20 | Published: 2022/09/30 | ePublished: 2022/09/30

References

1. [1] A. A. Barforoush, H. Shirazi, and H. Emami, "A new classification framework to evaluate the entity profiling on the Web: past, present and future," ACM Comput. Surv., vol. 50, no. 3, pp. 1-39, 2017. [DOI:10.1145/3066904]

2. [2] H. Emami, H. Shirazi, and A. A. Barforoush, "A Semantic approach to person profile extraction from Farsi documents," J. Inf. Syst. Telecommun., vol. 4, no. 4, pp. 232-243, 2016.

3. [3] W. Li, R. Srihari, C. Niu, and X. Li, "Entity profile extraction from large corpora," in Pacific Association for Computational Linguistics Conference (PACLING-2003), Harifax, Canada, 2003.

4. [4] Y. Chen, S. Y. Mei Lee, and C. R. Huang, "A robust web personal name information extraction system," Expert Syst. Appl., vol. 39, no. 3, pp. 2690-2699, 2012. [DOI:10.1016/j.eswa.2011.08.125]

5. [5] U. Distant and S. Machine, "Domain-Specific Relation Extraction Using Distant Supervision Machine Learning," in Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015), 2015, pp. 978-989.

6. [6] N. T. Nakashole, "Automatic extraction of facts, relations, and entities for web-scale knowledge base population," University of Saarland, 2013.

7. [7] S. Soderland, B. Roof, B. Qin, and S. Xu, "Adapting Open Information Extraction to Domain-Specific Relations," AI Mag., vol. 31, no. 3, pp. 93-102, 2010. [DOI:10.1609/aimag.v31i3.2305]

8. [8] S. Soderland, J. Gilmer, R. Bart, O. Etzioni, and D. Weld, "Open Information Extraction to KBP Relations in 3 Hours," in Proceedings of TAC-KBP 2013, Maryland, USA, 2013.

9. [9] M. Shamsfard, "Challenges and open problems in Persian text processing," in Proceedings of 5th Language & Technology Conference (LTC), Poznań, Poland, 2011, pp. 65-69.

10. [10] Y. Chen, S. Lee, and C. Huang, "Polyuhk: A robust information extraction system for web personal names," in 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, Madrid, Spain, 2009.

11. [11] S. Lyons and D. Smith, "Domain-specific information extraction structures," Proc. - Int. Work. Database Expert Syst. Appl. DEXA, vol. 2002-Janua, pp. 80-84, 2002.

12. [12] Y. Shinyama and S. Sekine, "Preemptive information extraction using unrestricted relation discovery," in Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, New York, USA, 2006, pp. 304-311. [DOI:10.3115/1220835.1220874]

13. [13] N. Chambers and D. Jurafsky, "Unsupervised learning of narrative event chains," in Proceedings of the Association of Computational Linguistics (ACL), Columbus, Ohio, 2008, pp. 789-797.

14. [14] N. Kasch and T. Oates, "Mining script-like structures from the web," in Proceedings of the NAACL HLT, Los Angeles, California, 2010, pp. 34-42.

15. [15] N. Chambers and D. Jurafsky, "Template-Based Information Extraction without the Templates," in HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, 2011.

16. [16] M. Banko, M. Cafarella, and S. Soderland, "Open information extraction from the web," in International Joint Conferences on Artificial Intelligence, Hyderabad, India, 2007, pp. 2670-2676.

17. [17] R. Johansson and P. Nugues, "LTH : Semantic Structure Extraction using Nonprojective Dependency Trees," in Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), Prague, 2007, pp. 227-230. [DOI:10.3115/1621474.1621522]

18. [18] M. Scaiano and D. Inkpen, "Automatic frame extraction from sentences," in Canadian Conference on Artificial Intelligence, Kelowna, British Columbia, 2009, pp. 110-120. [DOI:10.1007/978-3-642-01818-3_13]

19. [19] C. Bejan, C.A., Hathaway, "UTD-SRL: A Pipeline Architecture for Extracting Frame Semantic Structures," in Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, 2007. [DOI:10.3115/1621474.1621576]

20. [20] H. Fadaei and M. Shamsfard, "Extracting conceptual relations from Persian resources," in ITNG2010 - 7th International Conference on Information Technology: New Generations, Las Vegas, Nevada, USA, 2010, pp. 244-248. [DOI:10.1109/ITNG.2010.191]

21. [21] M. Moradi, B. Vazirnezhad, and M. Bahrani, "Commonsense Knowledge Extraction for Persian Language: A Combinatory Approach," Iran. J. Inf. Process. Manag., vol. 31, no. 1, pp. 109-124, 2015.

22. [22] M. Shamsfard, "Lexico-syntactic and Semantic Patterns for Extracting Knowledge from Persian Texts," Int. J. Comput. Sci. Eng., vol. 2, no. 6, pp. 2190-2196, 2010.

23. [23] H. Emami, H. Shirazi, A. A. Barforoush, and M. Hourali, "A Pattern-Matching Method for Extracting Personal Information in Farsi Content," U.P.B. Sci. Bull., Ser. C, vol. 78, no. 1, pp. 125-138, 2016.

24. [24] R. Al-Rfou, V. Kulkarni, B. Perozzi, and S. Skiena, "Polyglot-NER: Massive Multilingual Named Entity Recognition," in Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, British Columbia, Canada, 2015, pp. 586-594. [DOI:10.1137/1.9781611974010.66]

25. [25] F. Fallahi and M. Shamsfard, "Recognizing Anaphora Reference in Persian Sentences," Int. J. Comput. Sci. Issues, vol. 8, no. 2, pp. 324-329, 2011.

26. [26] A. Moro, A. Raganato, and R. Navigli, "Entity linking meets word sense disambiguation: a unified approach," Trans. Assoc. Comput. Linguist., vol. 2, pp. 231-244, 2014. [DOI:10.1162/tacl_a_00179]

27. [27] Z. M. Arani and A. Abdollahzadeh Barforoush, "Semantic Role Labeling using Syntactic Dependency Analysis and Noun Semantic Catergory," in 20th Annual Conference of Computer Society of Iran, Mashhad, Iran (In Farsi), 2015, pp. 619-624.

28. [28] K. Kipper, A. Korhonen, N. Ryant, and M. Palmer, "A large-scale classification of English verbs," Lang. Resour. Eval., vol. 42, no. 1, pp. 21-40, 2008. [DOI:10.1007/s10579-007-9048-2]

29. [29] E. Loper, S. Yi, and M. Palmer, "Combining lexical resources: mapping between propbank and verbnet," in Proceedings of the 7th International Workshop on Computational Linguistics, Tilburg, Netherlands, 2007.

30. [30] D. M. W. Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation," J. Mach. Learn. Technol., vol. 2, no. 1, pp. 37-63, 2011.

31. [31] C. Baker and M. Ellsworth, "SemEval'07 Task 19: Frame Semantic Structure Extraction," in Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, 2007, pp. 99-104. [DOI:10.3115/1621474.1621492]

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.