A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
ACM Computing Surveys (CSUR)
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
On Clustering Validation Techniques
Journal of Intelligent Information Systems
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Journal of Machine Learning Research
Distribution of content words and phrases in text and language modelling
Natural Language Engineering
Stability-based validation of clustering solutions
Neural Computation
GaP: a factor model for discrete data
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Applying discrete PCA in data analysis
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Document clustering using nonnegative matrix factorization
Information Processing and Management: an International Journal
Clustering with Bregman Divergences
The Journal of Machine Learning Research
Monte Carlo Statistical Methods
Monte Carlo Statistical Methods
Expert Systems with Applications: An International Journal
Field independent probabilistic model for clustering multi-field documents
Information Processing and Management: an International Journal
Using LDA to detect semantically incoherent documents
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Research of fast SOM clustering for text information
Expert Systems with Applications: An International Journal
Text segmentation: A topic modeling perspective
Information Processing and Management: an International Journal
Fast growing self organizing map for text clustering
ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Hi-index | 0.00 |
In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build ''soft'' theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional ''semantic'' space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach.