An Information-Theoretic Approach for Unsupervised Topic Mining in Large Text Collections

Authors:
Eduardo H. Ramirez;Ramon F. Brena
Affiliations:
-;-
Venue:
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2009

Citing 3
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we focus on the task of identifying topics in large text collections in a completely unsupervised way. In contrast to probabilistic topic modeling methods that require first estimating the density of probability distributions, we model topics as subsets of terms that are used as queries to an index of documents. By retrieving the documents relevant to those topical-queries we obtain overlapping clusters of semantically similar documents. In order to find the topical-queries we generate candidate queries using signature-calculation heuristics such as those used on duplicate-detection methods and then evaluate candidates using an information-gain function defined as "semantic force". The method is targeted to the semantic analysis of collections sized in the order of millions of documents, so, it has been implemented in map-reduce style. We present some initial results to support the feasibility of the approach.