Active clustering of biological sequences

  • Authors:
  • Konstantin Voevodski;Maria-Florina Balcan;Heiko Röglin;Shang-Hua Teng;Yu Xia

  • Affiliations:
  • Google, New York, NY;College of Computing, Georgia Institute of Technology, Atlanta, GA;Department of Computer Science, University of Bonn, Bonn, Germany;Computer Science Department, University of Southern California, Los Angeles, CA;Bioinformatics Program and Department of Chemistry, Boston University, Boston, MA

  • Venue:
  • The Journal of Machine Learning Research
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.