SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Sublinear time approximate clustering
SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Approximate clustering via core-sets
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Performance Guarantees for Hierarchical Clustering
COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Virtual landmarks for the internet
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
The Effectiveness of Lloyd-Type Methods for the k-Means Problem
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
A divide-and-merge methodology for clustering
ACM Transactions on Database Systems (TODS)
Sublinear-time approximation algorithms for clustering via random sampling
Random Structures & Algorithms - Proceedings from the 12th International Conference “Random Structures and Algorithms”, August1-5, 2005, Poznan, Poland
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate clustering without the approximation
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Clustering with or without the approximation
COCOON'10 Proceedings of the 16th annual international conference on Computing and combinatorics
A unified framework for approximating and clustering data
Proceedings of the forty-third annual ACM symposium on Theory of computing
Min-sum clustering of protein sequences with limited distance information
SIMBAD'11 Proceedings of the First international conference on Similarity-based pattern recognition
Linear time algorithms for clustering problems in any dimensions
ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Clustering under approximation stability
Journal of the ACM (JACM)
Hi-index | 0.00 |
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.