Signal and Data Processing

fa الگوسازی موضوع‌ها بر پایه‌ی روش بیز گوناگونی Topic Modeling Based on Variational Bayes Method مقالات پردازش متن Paper كاربردي Applicable در این مقاله، برپایه‌ی روش بیز گوناگونی، نشان می‌دهیم که روش تخصیص پنهان دیریکله که یک مدل احتمالاتی مولّد است و در پردازش زبان‌های طبیعی، متن‌کاوی، کاهش ابعاد، و زیست‌داده‌ورزی کاربرد دارد،  نسبت به روش تحلیل معنایی پنهان احتمالاتی در مدل‌بندی داده‌ها عملکرد بهتری دارد. در این باره، ابتدا یک مدل بیزی را در مدل‌سازی موضوع‌ها شرح می‌دهیم. آنگاه با روش بیز گوناگونی و الگوریتم امیدریاضی-بیشینه‌سازی (EM) پارامترهای مدل را برآورد می‌کنیم. سپس الگوریتم ارائه شده، موسوم به الگوریتم EM گوناگونی، را برپایه‌ی یک مجموعه‌داده‌ی نوشتاری از داده‌های واقعی در زمینه‌ی تحلیل داده‌های خبری پیاده‌سازی می‌کنیم و مدل‌بندی زبانی را بر اساس ملاک سرگشتگی بررسی می‌کنیم، و دقت خوشه‌بندی موضوع‌ها و کاربرد کاهش ابعاد داده‌های حجیم را با کمک ماشین بردار پشتیبان می‌سنجیم. همچنین در مقایسه‌ای دیگر، کاربرد الگوریتم پیشنهادی را در پالایش همکارانه بررسی می‌کنیم. The Latent Dirichlet Allocation (LDA) model is a generative model with several applications in natural language processing, text mining, dimension reduction, and bioinformatics. It is a powerful technique in topic modeling in text mining, which is a data mining method to categorize documents by their topic. Basic methods for topic modeling, including TF-IDF, unigram, and mixture of unigrams successfully deployed in modern search engines. Although these methods have some useful benefits, they don’t provide much summarization and reduction. To overcome these shortcomings, the latent semantic analysis (LSA) has been proposed, which uses singular value decomposition (SVD) of word-document matrix to compress big collection of text corpora. User’s search key words can be queried by making a pseudo-document vector. The next improvement step in topic modeling was probabilistic latent semantic analysis (PLSA), which has a close relation to LSA and matrix decomposition with SVD. By introducing of exchangeability for the words in documents, the topic modeling has been proceeded beyond PLSA and leads to LDA model. We consider a corpus <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" >  contains M  documents, each document <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image003.png" >  has <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image004.png" >  words, and each word is an indicator from one of <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image005.png" >  vocabularies. We defined a generative model for generation of each document as follows. For each document draw its topic <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image006.png" >  from <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image007.png" >  and repeatedly for each <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image008.png" >  draw topic of each word <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image009.png" >  from <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image010.png" >  and draw each word from the probability matrix of <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image011.png" >  with probability of <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image012.png" > . We can repeat this procedure to generate whole documents of corpus. We want to find corpus related parameters <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image013.png" >  and <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image014.png" >  as well as latent variables <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image015.png" >  and <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image006.png" >  for each document. Unfortunately, the posterior <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image016.png" >  is intractable, and we have to choose an approximation scheme. In this paper we utilize LDA for collection of discrete text corpora. We describe procedures for inference and parameter estimation. Since computing posterior distribution of hidden variables given a document is intractable to compute in general, we use approximate inference algorithm called variational Bayes method. The basic idea of variational Bayes is to consider a family of adjustable lower bound on the posterior, then finds the tightest possible one. To estimate optimal hyper-parameters in the model, we used the empirical Bayes method, as well as a specialized expectation-maximization (EM) algorithm called variational-EM algorithm. The results are reported in document modeling, text classification, and collaborative filtering. The topic modeling of LDA and PLSA models are compared on a Persian news data set. It has been observed that LDA has perplexity between <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image017.png" >  and <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image018.png" > , while the PLSA has perplexity between <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image019.png" >  and <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image020.png" > , which shows domination of LDA over PLSA. The LDA model has also been applied for dimension reduction in a document classification problem, along with the support vector machines (SVM) classification method. Two competitor models are compared, first trained on a low-dimensional representation provided by LDA and the second trained on all documents of corpus, with accuracies <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image021.png" >  and <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image022.png" > , respectively, this means we lose accuracy but it remains in reasonable range when we use LDA model for dimensionality reduction. Finally, we used the LDA and PLSA methods along with the collaborative filtering for MovieLens 1m data set, and we observed that the predictive-perplexity of LDA changes from <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image023.png" >  to <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image024.png" >  while it changes from <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image025.png" >  to <img alt="" chromakey="white" src="file:///C:/Users/SF1A4~1.MTA/AppData/Local/Temp/msohtmlclip1/01/clip_image026.png" >  for PLSA, again showing the domination of the LDA method. روش بیز گوناگونی, تخصیص پنهان دیریکله, الگوریتم امیدریاضی-بیشینه‌سازی, یادگیری ماشین, پردازش زبان‌های طبیعی Variational Bayes method, Latent Dirichlet allocation, Expectation-Maximization algorithm, Machine learning, Natural language processing 39 58 http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-2200-1&slc_lang=fa&sid=1 Vahid Heidari وحید حیدری vahid.heidari@ut.ac.ir 100319475328460012427 100319475328460012427 No دانشگاه تهران S. Mahmoud Taheri سید محمود طاهری sm_taheri@ut.ac.ir 100319475328460012428 100319475328460012428 Yes دانشگاه تهران Morteza Amini مرتضی امینی morteza.amini@ut.ac.ir 100319475328460012429 100319475328460012429 No دانشگاه تهران