Iterative Clustering of High Dimensional Text Data Augmented by Local Search

  • Authors:
  • Inderjit S. Dhillon;Yuqiang Guan;J. Kogan

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The k-means algorithm with cosine similarity, alsoknown as the spherical k-means algorithm, is a popularmethod for clustering document collections. However,spherical k-means can often yield qualitatively poor results,especially when cluster sizes are small, say 25-30 documentsper cluster, where it tends to get stuck at a localmaximum far away from the optimal solution. In this paper,we present a local search procedure, which we call"first-variation" that refines a given clustering by incrementallymoving data points between clusters, thus achievinga higher objective function value. An enhancement offirst variation allows a chain of such moves in a Kernighan-Linfashion and leads to a better local maximum. Combiningthe enhanced first-variation with spherical k-meansyields a powerful "ping-pong" strategy that often qualitativelyimproves k-means clustering and is computationallyefficient. We present several experimental results to high-lightthe improvement achieved by our proposed algorithmin clustering high-dimensional and sparse text data.