A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Authors:
Shai Ben-David
Affiliations:
School of Computer Science, University of Waterloo, Waterloo, Canada N2L 3G1
Venue:
Machine Learning
Year:
2007

Citing 7
Cited 8

Sublinear time approximate clustering

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Learning in Neural Networks: Theoretical Foundations

Learning in Neural Networks: Theoretical Foundations
Approximation schemes for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Empirical Risk Approximation: An Induction Principle for Unsupervised Learning

Empirical Risk Approximation: An Induction Principle for Unsupervised Learning
Optimal Time Bounds for Approximate Clustering

Machine Learning
A k-Median Algorithm with Running Time Independent of Data Size

Machine Learning
The minimax distortion redundancy in empirical quantizer design

IEEE Transactions on Information Theory

Combining partitions by probabilistic label aggregation

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A discriminative framework for clustering via similarity functions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Approximate clustering without the approximation

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Clustering Stability: An Overview

Foundations and Trends® in Machine Learning
Approximation algorithms for tensor clustering

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Active clustering of biological sequences

The Journal of Machine Learning Research
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
MMSVC: An efficient unsupervised learning approach for large-scale datasets

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a framework of sample-based clustering. In this setting, the input to a clustering algorithm is a sample generated i.i.d by some unknown arbitrary distribution. Based on such a sample, the algorithm has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clustering problems that imply the existence of sampling based clustering algorithms that approximate the optimal clustering. We show that the K-median clustering, as well as K-means and the Vector Quantization problems, satisfy these conditions. Our results apply to the combinatorial optimization setting where, assuming that sampling uniformly over an input set can be done in constant time, we get a sampling-based algorithm for the K-median and K-means clustering problems that finds an almost optimal set of centers in time depending only on the confidence and accuracy parameters of the approximation, but independent of the input size. Furthermore, in the Euclidean input case, the dependence of the running time of our algorithm on the Euclidean dimension is only linear. Our main technical tool is a uniform convergence result for center based clustering that can be viewed as showing that the effective VC-dimension of k-center clustering equals k.