Random projection in dimensionality reduction: applications to image and text data
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate clustering via core-sets
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Clustering Data Streams: Theory and Practice
IEEE Transactions on Knowledge and Data Engineering
Clustering via matrix powering
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the impossibility of dimension reduction in l1
Journal of the ACM (JACM)
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Image-mapped data clustering: An efficient technique for clustering large data sets
Intelligent Data Analysis
Coresets for discrete integration and clustering
FSTTCS'06 Proceedings of the 26th international conference on Foundations of Software Technology and Theoretical Computer Science
Hi-index | 0.00 |
We deal with the problem of clustering data points. Given n points in a larger set (for example, R/sup d/) endowed with a distance function (for example, L/sup 2/ distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NP-hard in some high dimensional geometric settings, even for k=2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube (0, 1)/sup d/ with Hamming distance, and R/sup d/ either with L/sup 1/ distance, or with L/sup 2/ distance, or with the square of L/sup 2/ distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the k-median problem (and the related facility location problem), which has been considered in graph-theoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input. Our algorithms are based on a dimension reduction construction for the Hamming cube, which may be of independent interest.