K-means clustering with feature hashing

  • Authors:
  • Hajime Senuma

  • Affiliations:
  • University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • HLT-SS '11 Proceedings of the ACL 2011 Student Session
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to K-means, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.