K-means clustering with feature hashing

Authors:
Hajime Senuma
Affiliations:
University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan
Venue:
HLT-SS '11 Proceedings of the ACL 2011 Student Session
Year:
2011

Citing 5
Cited 0

Database-friendly random projections: Johnson-Lindenstrauss with binary coins

Journal of Computer and System Sciences - Special issu on PODS 2001
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Introduction to Information Retrieval

Introduction to Information Retrieval
Feature hashing for large scale multitask learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Hash Kernels for Structured Data

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to K-means, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.