Topic Modeling Based on Variational Bayes Method

Heidari, Vahid; Taheri, S. Mahmoud; Amini, Morteza

doi:10.61186/jsdp.20.2.39

Volume 20, Issue 2 (9-2023) JSDP 2023, 20(2): 39-58 | Back to browse issues page

‎ 10.61186/jsdp.20.2.39

Mendeley

Zotero

RefWorks

Heidari V, Taheri S M, Amini M. Topic Modeling Based on Variational Bayes Method. JSDP 2023; 20 (2) : 3
URL: http://jsdp.rcisp.ac.ir/article-1-1228-en.html

Topic Modeling Based on Variational Bayes Method

Vahid Heidari

, S. Mahmoud Taheri ^*

, Morteza Amini

Abstract: (1748 Views)

The Latent Dirichlet Allocation (LDA) model is a generative model with several applications in natural language processing, text mining, dimension reduction, and bioinformatics. It is a powerful technique in topic modeling in text mining, which is a data mining method to categorize documents by their topic.
Basic methods for topic modeling, including TF-IDF, unigram, and mixture of unigrams successfully deployed in modern search engines. Although these methods have some useful benefits, they don’t provide much summarization and reduction. To overcome these shortcomings, the latent semantic analysis (LSA) has been proposed, which uses singular value decomposition (SVD) of word-document matrix to compress big collection of text corpora. User’s search key words can be queried by making a pseudo-document vector. The next improvement step in topic modeling was probabilistic latent semantic analysis (PLSA), which has a close relation to LSA and matrix decomposition with SVD. By introducing of exchangeability for the words in documents, the topic modeling has been proceeded beyond PLSA and leads to LDA model.
We consider a corpus

contains M documents, each document

has

words, and each word is an indicator from one of

vocabularies. We defined a generative model for generation of each document as follows. For each document draw its topic

from

and repeatedly for each

draw topic of each word

from

and draw each word from the probability matrix of

with probability of

. We can repeat this procedure to generate whole documents of corpus. We want to find corpus related parameters

and

as well as latent variables

and

for each document. Unfortunately, the posterior

is intractable, and we have to choose an approximation scheme.
In this paper we utilize LDA for collection of discrete text corpora. We describe procedures for inference and parameter estimation. Since computing posterior distribution of hidden variables given a document is intractable to compute in general, we use approximate inference algorithm called variational Bayes method. The basic idea of variational Bayes is to consider a family of adjustable lower bound on the posterior, then finds the tightest possible one. To estimate optimal hyper-parameters in the model, we used the empirical Bayes method, as well as a specialized expectation-maximization (EM) algorithm called variational-EM algorithm.
The results are reported in document modeling, text classification, and collaborative filtering. The topic modeling of LDA and PLSA models are compared on a Persian news data set. It has been observed that LDA has perplexity between

and

, while the PLSA has perplexity between

and

, which shows domination of LDA over PLSA.
The LDA model has also been applied for dimension reduction in a document classification problem, along with the support vector machines (SVM) classification method. Two competitor models are compared, first trained on a low-dimensional representation provided by LDA and the second trained on all documents of corpus, with accuracies

and

, respectively, this means we lose accuracy but it remains in reasonable range when we use LDA model for dimensionality reduction.
Finally, we used the LDA and PLSA methods along with the collaborative filtering for MovieLens 1m data set, and we observed that the predictive-perplexity of LDA changes from

while it changes from

for PLSA, again showing the domination of the LDA method.

Article number: 3

Keywords: Variational Bayes method, Latent Dirichlet allocation, Expectation-Maximization algorithm, Machine learning, Natural language processing

Full-Text [PDF 1412 kb] (553 Downloads)

Type of Study: Applicable | Subject: Paper
Received: 2021/04/19 | Accepted: 2023/02/22 | Published: 2023/10/22 | ePublished: 2023/10/22

References

1. [1] M. S. Rasoli, B. Minaei Bidgoli, H. Faili, and M. Aminian, "Unsupervised Persian Verb Valency Induction," Signal and Data Processing, vol. 9, no. 2, 3-12, 2013.

2. [2] E. Asgarian, M. Kahani, and S. Sharifi, "HesNegar: Persian Sentiment WordNet," Signal and Data Processing, vol. 15, no. 1, pp. 71-86, 2018. [DOI:10.29252/jsdp.15.1.71]

3. [3] H. Faili, "Phrasal Verb Translation from English to Persian Using Statistical Parsing," Signal and Data Processing, vol. 7, no. 1, pp. 66-76, 2010.

4. [4] H. Faili, H. Ghader, and M. Analoui, "A Bayesian Model for Supervised Grammar Induction," Signal and Data Processing, vol. 9, no. 1, pp. 19-34, 2012.

5. [5] B. Masoudi, and R. G. Saeid, "Farsi Word Sense Disambiguation with LDA Topic Model," Signal and Data Processing, vol. 12, no. 4, pp. 117-125, 2016.

6. ]6[ E. Asgari, and J.-C. Chappelier, "Linguistic ]1[ Analysis of Persian Poems, "Proceedings of the Second Workshop on Computational Linguistics for Literature, Atlanta, Georgia, pp. 23-31, 2013.

7. ]7[ D. Blei, A. Ng, and J. Michael, "Latent Dirichlet Allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.

8. ]8[ S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, vol. 41, pp. 391-407, 1990. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 [DOI:10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9]

9. ]9[ Y. Du, Y. Yi, X. Li, X. Chen, Y. Fan, and F. Su, "Extracting and Tracking Hot Topics of Micro-blogs Based on Improved Latent Dirichlet Allocation," Engineering Applications of Artificial Intelligence, vol. 87, pp. 103279, 2020. [DOI:10.1016/j.engappai.2019.103279]

10. ]10[ C. Geigle, "Inference Methods for Latent Dirichlet Allocation,", Course notes (cs598cxz advanced topics in information retrieval), Department of Computer Science, University of Illinois at Urbana-Champaign, 2016.

11. ]11[ Y. Gong, Q. Zhang, and X. Huang, "Hashtag Recommendation for Multimodal Microblog Posts," Neurocomputing, vol. 272, pp. 170-177, 2018. [DOI:10.1016/j.neucom.2017.06.056]

12. ]12[ M. Hoffman, D. Blei, and F. Bach, "Online Learning for Latent Dirichlet Allocation," Advances in Neural Information Processing Systems. pp. 856-864, 2010.

13. ]13[ T. Hofmann, "Probabilistic Latent Semantic Indexing," SIGIR '99. pp. 50-57, 1999. [DOI:10.1145/312624.312649]

14. ]14[ T. Hofmann, "Probabilistic Latent Semantic Analysis," UAI'99. pp. 289-296, 1999. [DOI:10.1145/312624.312649]

15. ]15[ T. Hofmann, "Unsupervised Learning by Probabilistic Latent Semantic Analysis," Machine Learning, vol. 42, pp. 177-196, 2001. [DOI:10.1023/A:1007617005950]

16. ]16[ H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, and L. Zhao, "Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, a Survey," Multimedia Tools Applications, vol. 78, pp. 15169-15211, 2019. [DOI:10.1007/s11042-018-6894-4]

17. ]17[ D. Jurafsky, and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, USA: Prentice Hall PTR, 2000.

18. ]18[ J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets, USA: Cambridge University Press, 2014. [DOI:10.1017/CBO9781139924801] [PMID]

19. ]19[ B. Liu, C. Wang, Y. Wang, K. Zhang, and C. Wang, "Microblog Topic Mining Based on FR-DATM," Chinese Journal of Electronics, vol. 27, pp. 334-341, 2018. [DOI:10.1049/cje.2017.12.006]

20. ]20[ X. Liu, Y. Gao, Z. Cao, and G. Sun, "LDA-based Topic Mining of Microblog Comments," Journal of Physics: Conference Series, vol. 1757, pp. 012118, 2021. [DOI:10.1088/1742-6596/1757/1/012118]

21. ]21[ Y. Lu, Q. Mei, and C. Zhai, "Investigating Task Performance of Probabilistic Topic Models: An Empirical Study of PLSA and LDA," Information Retrieval, vol. 14, pp. 178-203, 2011. [DOI:10.1007/s10791-010-9141-9]

22. ]22[ H. F. Maxwell, and K. Joseph, "The MovieLens Datasets: History and Context," ACM Transactions on Interactive Intelligent Systems, vol. 5, 2015. [DOI:10.1145/2827872]

23. ]23[ T. Minka, "Estimating a Dirichlet Distribution,", Technical report, M.I.T., 2000.

24. ]24[ K. P. Morphy, Machine Learning: A Probabilistic Perspective, London, England: MIT Press, 2012.

25. ]25[ A. Raj, M. Stephens, and J. K. Pritchard, "fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets," Genetics, vol. 197, pp. 573-589, 2014. [DOI:10.1534/genetics.114.164350] [PMID] []

26. ]26[ V. Smidl, and A. Quinn, The Variational Bayes Method in Signal Processing, Berlin Heidelberg, Germany: Springer, 2006.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Signal and Data Processing

Vote