Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

Authors:
Nguyen Kim Anh;Nguyen The Tam;Ngo Van Linh
Affiliations:
Hanoi University of Science and Technology;Hanoi University of Science and Technology;Hanoi University of Science and Technology
Venue:
Proceedings of the Fourth Symposium on Information and Communication Technology
Year:
2013

Citing 7
Cited 0

A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Relationship-based clustering and cluster ensembles for high-dimensional data mining

Relationship-based clustering and cluster ensembles for high-dimensional data mining
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

The Journal of Machine Learning Research
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Document clustering via dirichlet process mixture model with feature selection

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.