Model-based document clustering with a collapsed gibbs sampler

Authors:
Daniel David Walker;Eric K. Ringger
Affiliations:
Brigham Young University, Provo, UT, USA;Brigham Young University, Probo, UT, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 6
Cited 4

An experimental comparison of model-based clustering methods

Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Iterative Clustering of High Dimensional Text Data Augmented by Local Search

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Latent Dirichlet Co-Clustering

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Comparing clusterings---an information based distance

Journal of Multivariate Analysis

The NVI clustering evaluation measure

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
iLoc: a framework for incremental location-state acquisition and prediction based on mobile sensors

Proceedings of the 18th ACM conference on Information and knowledge management
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Representing document as dependency graph for document clustering

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Model-based algorithms are emerging as a preferred method for document clustering. As computing resources improve, methods such as Gibbs sampling have become more common for parameter estimation in these models. Gibbs sampling is well understood for many applications, but has not been extensively studied for use in document clustering. We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a particular model, namely a mixture of multinomials model, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm.