A Novel Text Mining Method for User Context Extraction to Improve Search Engine Results Ranking

,; Ahmadi, Ali

doi:10.29252/jsdp.14.3.65

Signal and Data Processing Journal A scientific journal officially licensed by the Commission for Scientific Publications of the (MSRT). Publisher: Research Ceter for Developmen of Technologies

EN FA

Volume 14, Issue 3 (12-2017) JSDP 2017, 14(3): 65-82 | Back to browse issues page

‎ 10.29252/jsdp.14.3.65

Mendeley

Zotero

RefWorks

Ahmadi A. A Novel Text Mining Method for User Context Extraction to Improve Search Engine Results Ranking. JSDP 2017; 14 (3) :65-82
URL: http://jsdp.rcisp.ac.ir/article-1-473-en.html

A Novel Text Mining Method for User Context Extraction to Improve Search Engine Results Ranking

Ali Ahmadi

Abstract: (7456 Views)

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user query. In this example, necessity of real time ability should be mentioned. Keyphrase extraction and some other fields like Information extraction, natural language processing, text summarization, query understanding, machine translation, and text similarity are subsets of text processing. So many efforts in text processing have been established, but there are still many open problems, especially in semantically document understanding subjects. Although these subjects seem not to be very hard for humankind but they are very complex and confusing for a computer, because there is no standard structure to save documents so that computers be able to extract semantics and contents.
Document understanding and keyphrase extraction are some of the most important text processing goals. Many statistical and linguistic approaches are proposed in order to address these complex goals. Some methods work based on multi documents and some others on single document which all are generally more difficult than multi documents methods. Some methods use learning algorithms with training data and some others do not. Using natural language processing tools or resources -like ontologies- are effective ways to improve results, but these tools are not reliable for all languages. There are some articles for keyphrase extraction based on co-occurrence and also some statistical methods. Moreover, sometimes it is an important feature for a method to make real time outputs. Based on these characteristics, many approaches have been proposed in the literature.
In this paper, we present a new approach for keyphrase extraction from a single document. We present a language-independent approach based on combination of statistical information extracted from document and some logical rules named fundamental text rules. In this approach, there is no need to any natural language processing, nor to ontology and nor to any document corpus. We illustrate a real time method to understand each document focuses by extracting its phrases from segmented document without using any learning algorithm. Then, the Score for each phrase is calculated based on its occurrence and its related phrases occurrences. Then, fundamental text rules omit some phrases based on their scores and their places in text. Remained phrases shows the document focuses. Evaluation shows that our approach takes a high recall and precision in key phrase extraction with very good accuracy in text focuses understanding. These keyphrases extracted of a text presents the most important concepts of that text and it is used to retrieve documents in search engines more efficiently.

Keywords: text mining, information retrieval, user context, search engine results ranking

Full-Text [PDF 5540 kb] (4149 Downloads)

Type of Study: Research | Subject: Paper
Received: 2015/12/25 | Accepted: 2016/07/24 | Published: 2018/01/29 | ePublished: 2018/01/29

References

1. [1] Hamdi, Mohamed Salah, "SOMSE: A semantic map based meta-search engine for the purpose of web information customization", Applied Soft Computing, vol. 11, no. 1, pp. 1310-1321, 2011. [DOI:10.1016/j.asoc.2010.04.004]

2. [2] Mangold, Christoph, "A survey and classification of semantic search approaches", International Journal of Metadata, Semantics and Ontologies, vol. 2, no. 1, pp. 23-34, 2007. [DOI:10.1504/IJMSO.2007.015073]

3. [3] Kirar, Dilip, and Pranita Jain, "Equirs: Explicitly query understanding information retrieval system based on hmm", International Journal of Engineering Inventions, vol. 2, no 1, pp. 31-36, Jan. 2013.

4. [4] Vaughan, Liwen, and Mike Thelwall, "Search engine coverage bias: evidence and possible causes", Information processing & management, vol. 40, no. 4, pp. 693-707, 2004. [DOI:10.1016/S0306-4573(03)00063-3]

5. [5] Jansen, Bernard J., et al, "Real life information retrieval: A study of user queries on the web", In ACM SIGIR Forum, vol. 32, no. 1, pp. 5-17, 1998. [DOI:10.1145/281250.281253]

6. [6] Jansen, Bernard J., and Danielle Booth, "Classifying web queries by topic and user intent", CHI'10 Extended Abstracts on Human Factors in Computing Systems, pp. 4285-4290, ACM, 2010.

7. [7] Calderón-Benavides, Liliana, Cristina González-Caro, and Ricardo Baeza-Yates, "Towards a deeper understanding of the user's query intent", SIGIR 2010 Workshop on Query Representation and Understanding, pp. 21-24, 2010.

8. [8] Abowd, Gregory D., et al, "Towards a Better Understanding of Context and Context-Awareness", Handheld and ubiquitous computing. Springer Berlin Heidelberg, pp. 304-307, 1999. [DOI:10.1007/3-540-48157-5_29]

9. [9] Allan, James, and Hema Raghavan, "Using part-of-speech patterns to reduce query ambiguity", In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 307-314, 2002. [DOI:10.1145/564376.564430]

10. [10] Bäurle, Florian, "A user interface for semantic full text search", Master Thesis, Faculty of Engineerin, University of Freiburg, 2011.

11. [11] Bing, Lidong, and Wai Lam, "Investigation of web query refinement via Topic Analysis and Learning with Personalization", 2011.

12. [12] Fonseca, Bruno M., et al, "Concept-based interactive query expansion", In Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 696-703, 2005. [DOI:10.1145/1099554.1099726]

13. [13] Song, Wei, et al, "An effective query recommendation approach using semantic strateg-ies for intelligent information retrieval", Expert Systems with Applications, vol. 41, no. 2, pp. 366-372, 2014. [DOI:10.1016/j.eswa.2013.07.052]

14. [14] Bordogna, Gloria, et al, "Disambiguated query suggestions and personalized content-similarity and novelty ranking of clustered results to optimize web searches", Information Processing & Management, vol. 48, no. 3, pp. 419-437, 2012. [DOI:10.1016/j.ipm.2011.03.008]

15. [15] Broccolo, Daniele, et al, "Generating suggestions for queries in the long tail with an inverted index", Information Processing & Management, vol. 48, no. 2, pp. 326-339, 2012. [DOI:10.1016/j.ipm.2011.07.005]

16. [16] González-Caro, Cristina, and Ricardo Baeza-Yates, "A multi-faceted approach to query intent classification", String Processing and Informa-tion Retrieval, Springer Berlin Heidelb-erg, pp. 368-379, 2011. [DOI:10.1007/978-3-642-24583-1_36]

17. [17] Jiang, Daxin, Jian Pei, and Hang Li, "Mining Search and Browse Logs for Web Search: A Survey", ACM Transactions on Computational Logic, pp. 1-42, Apr. 2013. [DOI:10.1145/2508037.2508038]

18. [18] Li, Lin, et al, "A feature-free search query classification approach using semantic dist-ance", Expert Systems with Applications, vol. 39, no. 12, pp. 10739-10748, 2012. [DOI:10.1016/j.eswa.2012.02.191]

19. [19] Bai, Lu, et al, "Exploring the query-flow graph with a mixture model for query recommenda-tion", Proceedings of IGIR Work-shop on Query Representation and Understand-ing, Beijing, China, Jul. 2011.

20. [20] Beeferman, Doug, and Adam Berger, "Agglomerative clustering of a search engine query log." In Proceedings of the sixth ACM SIGKDD international conference on Knowl-edge discov-ery and data mining, pp. 407-416, 2000. [DOI:10.1145/347090.347176]

21. [21] Andersen, Casper, and Daniel Christensen, "User Logs for Query Disambiguation", 2013.

22. [22] Sondhi, Parikshit, Raman Chandrasekar, and Robert Rounthwaite. "Using query context mod-els to construct topical search engin-es", In Proceed-ings of the third symposium on Information interaction in context, pp. 75-84, 2010. [DOI:10.1145/1840784.1840797]

23. [23] Wu, Wei, Bin Zhang, and Mari Ostendorf. "Automatic generation of personalized annota-tion tags for twitter users", In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 689-692, 2010.

24. [24] Biancalana, Claudio, and Alessandro Micarelli. "Social tagging in query expansion: A new way for personalized web search", In Computational Science and Engineering, vol. 4, pp. 1060-1065, 2009. [DOI:10.1109/CSE.2009.492]

25. [25] Kramár, Tomáš, Michal Barla, and Mária Bieliková. "Disambiguating search by leverage-ing a social context based on the stream of user's activity", In User Modeling, Adaptation, and Personalization, pp. 387-392, 2010.

26. [26] Cao, Huanhuan, et al, "Context-aware query classification", In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 3-10, 2009. [DOI:10.1145/1571941.1571945]

27. [27] Joachims, Thorsten. "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization", Carnegie-mellon univ Pittsbur-gh pa dept of computer science, No. CMU-CS-96-118, 1996.

28. [28] Luo, Le, and Li Li. "Defining and evaluating classification algorithm for high-dimensional data based on latent topics", PloS one 9, No. 1, 2014.

29. [29] Zeng, Hua-Jun, et al, "Learning to cluster web search results", In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 210-217, 2004. [DOI:10.1145/1008992.1009030]

30. [30] Song, Ruihua, et al, "Identification of ambigu-ous queries in web search", Information Processing & Management, vol. 45, no. 2, pp. 216-229, 2009. [DOI:10.1016/j.ipm.2008.09.005]

31. [31] Li, Ying, Zijian Zheng, and Honghua Kathy Dai, "KDD CUP-2005 report: Facing a great chall-enge", ACM SIGKDD Explorations Newsletter 7, no. 2, pp. 91-99, 2005. [DOI:10.1145/1117454.1117466]

32. [32] Farhad Rad, Hamid Parvin, Atoosa dahbashi and Behrooz Minaei. "Improved Clustering Persian Text Based on Keyword Using Linguistic and Thesaurus Knowledge", Signal and Data Processing, Vol. 13, No. 1, P.P. 78-100, 2016.

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.