Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

  • Authors:
  • Nguyen Kim Anh;Nguyen The Tam;Ngo Van Linh

  • Affiliations:
  • Hanoi University of Science and Technology;Hanoi University of Science and Technology;Hanoi University of Science and Technology

  • Venue:
  • Proceedings of the Fourth Symposium on Information and Communication Technology
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.