Generative model-based document clustering: a comparative study

  • Authors:
  • Shi Zhong;Joydeep Ghosh

  • Affiliations:
  • Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA;Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.