Volume 15, Issue 4 (3-2019)                   JSDP 2019, 15(4): 57-70 | Back to browse issues page


XML Persian Abstract Print


Shahrood University of Technology
Abstract:   (3387 Views)

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution over topics and each word in the document is sampled from a chosen topic of that distribution. It assumes that a document is a bag of words and ignores the order of the words. Probabilistic topic models such as LDA which extract the topics based on documents-level word co-occurrences are not equipped to benefit from local word relationships. This problem is addressed by combining topics and n-grams, in models like Bigram Topic Model (BTM). BTM modifies the document generation process slightly by assuming that there are several different distributions of words for each topic, each of which correspond to a vocabulary word. Each word in a document is sampled from one of the distributions of its selected topic. The distribution is determined by its previous word. So BTM relies on exact word orders to extract local word relationships and thus is challenged by sparseness. Another way to solve the problem is to break each document into smaller parts for example paragraphs and use LDA on these parts to extract more local word relationships in these small parts. Again, we will be faced with sparseness and it is well-known that LDA does not work well on small documents. In this paper, a new probabilistic topic model is introduced which assumes a document is a set of overlapping windows but does not break the document into those parts and assumes the whole document as a single distribution over topics. Each window corresponds to a fixed number of words in the document. In the assumed generation process, we walk through windows and decide on the topic of their corresponding words. Topics are extracted based on words co-occurrences in the overlapping windows and the overlapping windows affect the process of document generation because; the topic of a word is considered in all the other windows overlapping on the word. On the other words, the proposed model encodes local word relationships without relying on exact word order or breaking the document into smaller parts. The model, however, takes the word order into account implicitly by assuming the windows are overlapped. The topics are still considered as distributions over words. The proposed model is evaluated based on its ability to extract coherent topics and its clustering performance on the 20 newsgroups dataset. The results show that the proposed model extracts more coherent topics and outperforms LDA and BTM in the application of document clustering.
 

Full-Text [PDF 13549 kb]   (889 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2017/11/24 | Accepted: 2019/01/26 | Published: 2019/03/8 | ePublished: 2019/03/8

References
1. [1] Faili, H., H. Ghader, and M. Morteza Analoui, "A Bayesian Model for Supervised Grammar Induc-tion," Signal and Data Processing, 2012. 9(1), pp. 19-34.
2. [2] D., et al. Wang, "Multi-document summarization using sentence-based topic models," 2009. Association for Computational Linguistics.
3. [3] S. S. Sadegi and B, vazir nejad, "Extractive summarization based on cognitive aspects of human mind for narrative text," Signal and Data Processing, vol.12(2), pp. 87-96, 2015
4. [4] H. Zhang and G. Zhong, "Improving short text classification by learning vector representations of both words and hidden topics," Knowledge-Based Systems, 2016. 102: pp. 76-86. [DOI:10.1016/j.knosys.2016.03.027]
5. [5] D.M. Blei, A.Y. Ng, and M.I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, pp. 993-1022, 2003.
6. [6] H.M. Wallch, "Topic modeling: beyond bag-of-words," ACM, 2006. [DOI:10.1145/1143844.1143967]
7. [7] C.D. Manning, et al., "Introduction to Information Retrieval," Cambridge University Press, pp. 496, 2008.
8. [8] im Walde, S.S. and A. Melinger, "An in-depth look into the co-occurrence distribution of semantic associates," Italian Journal of Linguistics, Special Issue on From Context to Meaning: Distributional Models of the Lexicon in Linguistics and Cognitive Science, 2008.
9. [9] N. Barbieri, et al., "Probabilistic topic models for sequence data," Machine learning, vol.93(1), pp. 5-29, 2013. [DOI:10.1007/s10994-013-5391-2]
10. [10] T.L. Griffiths, M. Steyvers, and J.B. Tenenbaum, "Topics in semantic representation." Psycho-logical review, vol.114(2), pp. 211, 2007. [DOI:10.1037/0033-295X.114.2.211] [PMID]
11. [11] X. Wang, A. McCallum, and X. Wei. "Topical n-grams: Phrase and topic discovery, with an application to information retrieval," IEEE, 2007. [DOI:10.1109/ICDM.2007.86] [PMCID]
12. [12] G. Yang, et al., "A novel contextual topic model for multi-document summarization, "Expert Sys-tems with Applications, vol. 42(3), pp. 1340-1352, 2015. [DOI:10.1016/j.eswa.2014.09.015]
13. [13] S. Jameel, W. Lam, and L. Bing, "Supervised topic models with word order structure for document classification and retrieval learning," Information Retrieval Journal, vol.18(4), pp. 283-330, 2015. [DOI:10.1007/s10791-015-9254-2]
14. [14] Y.W. The, "A hierarchical Bayesian language model based on Pitman-Yor processes," Associa-tion for Computational Linguistics, 2006.
15. [15] H. Noji, D. Mochihashi, and Y. Miyao. "Improvements to the Bayesian Topic N-Gram Models," in EMNLP, 2013.
16. [16] I. Sato and H. Nakagawa. "Topic models with power-law using Pitman-Yor process," ACM, 2010. [DOI:10.1145/1835804.1835890]
17. [17] Y.-S. Jeong and H.-J. Choi, "Overlapped latent Dirichlet allocation for efficient image segmenta-tion," Soft Computing, vol. 19(4), pp. 829-838. [DOI:10.1007/s00500-014-1410-x]
18. [18] Y. Zue, J. Zhao, and K. Xu, "Word network topic model: a simple but general solution for short and imbalanced texts," Knowledge and Information Systems, pp. 1-20, 2014.
19. [19] W. Ou, Z. Xie, and Z. Lv. "Spatially Regularized Latent topic Model for Simultaneous object discovery and segmentation," in Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on. 2015. IEEE. [DOI:10.1109/SMC.2015.511]
20. [20] T.L. Griffiths and M. Steyvers, "Finding scientific topics," in Proceedings of the National academy of Sciences, 2004. 101(suppl 1), pp. 5228-5235. [DOI:10.1073/pnas.0307752101] [PMID] [PMCID]
21. [21] T. Minka and J. Lafferty. "Expectation-propagation for the generative aspect model," Morgan Kaufmann Publishers In, 2002.
22. [22] J. Rennie, 20 Newsgroups. Available from: http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
23. [23] G. Heinrich, "Parameter estimation for text analy-sis," University of Leipzig, Tech. Rep, 2008.
24. [24] D. Newman, et al. "Automatic evaluation of topic coherence," in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Com-putational Linguistics. 2010. Association for Computational Linguistics.
25. [25] D. O'Callaghan, et al., "An analysis of the coherence of descriptors in topic modeling," Expert Systems with Applications, vol. 42(13), pp. 5645-5657, 2013. [DOI:10.1016/j.eswa.2015.02.055]
26. [26] D. Mimno , et al. "Optimizing semantic coherence in topic models," Association for Computational Linguistics, 2011.
27. [27] M. Meilă, "Comparing clusterings by the variation of information, in Learning theory and kernel machines," Springer, 2003, pp. 173-187. [DOI:10.1007/978-3-540-45167-9_14]

Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.