Active clustering of biological sequences

Authors:
Konstantin Voevodski;Maria-Florina Balcan;Heiko Röglin;Shang-Hua Teng;Yu Xia
Affiliations:
Google, New York, NY;College of Computing, Georgia Institute of Technology, Atlanta, GA;Department of Computer Science, University of Bonn, Bonn, Germany;Computer Science Department, University of Southern California, Los Angeles, CA;Bioinformatics Program and Department of Chemistry, Boston University, Boston, MA
Venue:
The Journal of Machine Learning Research
Year:
2012

Citing 15
Cited 1

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Sublinear time approximate clustering

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Approximate clustering via core-sets

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Performance Guarantees for Hierarchical Clustering

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Virtual landmarks for the internet

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
The Effectiveness of Lloyd-Type Methods for the k-Means Problem

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
A divide-and-merge methodology for clustering

ACM Transactions on Database Systems (TODS)
Sublinear-time approximation algorithms for clustering via random sampling

Random Structures & Algorithms - Proceedings from the 12th International Conference “Random Structures and Algorithms”, August1-5, 2005, Poznan, Poland
A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Machine Learning
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate clustering without the approximation

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Clustering with or without the approximation

COCOON'10 Proceedings of the 16th annual international conference on Computing and combinatorics
A unified framework for approximating and clustering data

Proceedings of the forty-third annual ACM symposium on Theory of computing
Min-sum clustering of protein sequences with limited distance information

SIMBAD'11 Proceedings of the First international conference on Similarity-based pattern recognition
Linear time algorithms for clustering problems in any dimensions

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming

Clustering under approximation stability

Journal of the ACM (JACM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.