A fast k-means implementation using coresets

Authors:
Gereon Frahling;Christian Sohler
Affiliations:
University of Paderborn, Paderborn, Germany;University of Paderborn, Paderborn, Germany
Venue:
Proceedings of the twenty-second annual symposium on Computational geometry
Year:
2006

Citing 22
Cited 10

Algorithms for clustering data

Algorithms for clustering data
Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract)

SCG '94 Proceedings of the tenth annual symposium on Computational geometry
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Sublinear time approximate clustering

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Approximate clustering via core-sets

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A local search approximation algorithm for k-means clustering

Proceedings of the eighteenth annual symposium on Computational geometry
Projective clustering in high dimensions using core-sets

Proceedings of the eighteenth annual symposium on Computational geometry
Clustering Algorithms

Clustering Algorithms
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Smaller core-sets for balls

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Acceleration of K-Means and Related Clustering Algorithms

ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
Approximation schemes for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Faster core-set constructions and data stream algorithms in fixed dimensions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Clustering Motion

Discrete & Computational Geometry
On coresets for k-means and k-median clustering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Optimal Time Bounds for Approximate Clustering

Machine Learning
Approximating extent measures of points

Journal of the ACM (JACM)
A Simple Linear Time (1+ ") -Approximation Algorithm for k-Means Clustering in Any Dimensions

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Coresets in dynamic geometric data streams

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Smaller coresets for k-median and k-means clustering

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry

Special Section: Point-Based Graphics: Fast vector quantization for efficient rendering of compressed point-clouds

Computers and Graphics
Approximating largest convex hulls for imprecise points

Journal of Discrete Algorithms
Domain-specific sentiment analysis using contextual feature generation

Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion
Approximating largest convex hulls for imprecise points

WAOA'07 Proceedings of the 5th international conference on Approximation and online algorithms
Automatic k-means for color enteromorpha image segmentation

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
Dynamic decentralized mapping of tree-structured applications on NoC architectures

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
The three steps of clustering in the post-genomic era: a synopsis

CIBB'10 Proceedings of the 7th international conference on Computational intelligence methods for bioinformatics and biostatistics
Dynamic k-means: a clustering technique for moving object trajectories

International Journal of Intelligent Information and Database Systems
k-means clustering on pre-calculated distance-based nearest neighbor search for image search

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
Learning Big (Image) Data via Coresets for Dictionaries

Journal of Mathematical Imaging and Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we develop an efficient implementation for a k-means clustering algorithm. Our algorithm is a variant of KMHybrid [28, 20], i.e. it uses a combination of Lloyd-steps and random swaps, but as a novel feature it uses coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. The main strength of the algorithm is that it can quickly determine clusterings of the same point set for many values of k. This is necessary in many applications, since, typically, one does not know a good value for k in advance. Once we have clusterings for many different values of k we can determine a good choice of k using a quality measure of clusterings that is independent of k, for example the average silhouette coefficient. The average silhouette coefficient can be approximated using coresets.To evaluate the performance of our algorithm we compare it with algorithm KMHybrid [28] on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300,000 to 4.9 million points. We show that our algorithm significantly outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates less than that of KMHybrid.We also computed clusterings and approximate average silhouette coefficient for k=1,…,100 for our input instances and discuss the performance of our algorithm in detail.