Volume 15, Issue 4 (3-2019)                   JSDP 2019, 15(4): 57-70 | Back to browse issues page

XML Persian Abstract Print

Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Rahimi M, Zahedi M, Mashayekhi H. A Probabilistic Topic Model based on Local Word Relationships in Overlapped Windows. JSDP. 2019; 15 (4) :57-70
URL: http://jsdp.rcisp.ac.ir/article-1-673-en.html
Shahrood University of Technology
Abstract:   (91 Views)

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution over topics and each word in the document is sampled from a chosen topic of that distribution. It assumes that a document is a bag of words and ignores the order of the words. Probabilistic topic models such as LDA which extract the topics based on documents-level word co-occurrences are not equipped to benefit from local word relationships. This problem is addressed by combining topics and n-grams, in models like Bigram Topic Model (BTM). BTM modifies the document generation process slightly by assuming that there are several different distributions of words for each topic, each of which correspond to a vocabulary word. Each word in a document is sampled from one of the distributions of its selected topic. The distribution is determined by its previous word. So BTM relies on exact word orders to extract local word relationships and thus is challenged by sparseness. Another way to solve the problem is to break each document into smaller parts for example paragraphs and use LDA on these parts to extract more local word relationships in these small parts. Again, we will be faced with sparseness and it is well-known that LDA does not work well on small documents. In this paper, a new probabilistic topic model is introduced which assumes a document is a set of overlapping windows but does not break the document into those parts and assumes the whole document as a single distribution over topics. Each window corresponds to a fixed number of words in the document. In the assumed generation process, we walk through windows and decide on the topic of their corresponding words. Topics are extracted based on words co-occurrences in the overlapping windows and the overlapping windows affect the process of document generation because; the topic of a word is considered in all the other windows overlapping on the word. On the other words, the proposed model encodes local word relationships without relying on exact word order or breaking the document into smaller parts. The model, however, takes the word order into account implicitly by assuming the windows are overlapped. The topics are still considered as distributions over words. The proposed model is evaluated based on its ability to extract coherent topics and its clustering performance on the 20 newsgroups dataset. The results show that the proposed model extracts more coherent topics and outperforms LDA and BTM in the application of document clustering.

Full-Text [PDF 13549 kb]   (59 Downloads)    
Type of Study: Research | Subject: Paper
Received: 2017/11/24 | Accepted: 2019/01/26 | Published: 2019/03/8 | ePublished: 2019/03/8

Add your comments about this article : Your username or Email:

Send email to the article author

© 2015 All Rights Reserved | Signal and Data Processing