Cluster Utility: A New Metric for Clustering Biological Sequences

  • Authors:
  • Jason Lee;Sun Kim

  • Affiliations:
  • School of Informatics,Indiana University;Center for Genomics and Bioinformatics,Indiana University

  • Venue:
  • CSBW '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference - Workshops
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sequence clustering problem is different from traditional clustering problems in that the features of sequences are not observable and sequences cannot be placed in a metric space, which most clustering algorithms assume. The most widely used approach is to build a sequence graph using the all-pairwise sequence comparison data and to use the graph to generate clusters of sequences. Like other clustering problems, a metric to evaluate results from a sequence clustering algorithm is needed, but the metrics for traditional clustering problems are not readily applicable due to their metric space assumption. We propose Cluster Utility (CU), a metric that is based on consideration of similarity within a cluster and difference between clusters without metric space assumption. CU showed a very high correlation with the quality index. CU scales very well with data size and its strong correlation with quality index was nearly invariable regardless of data size change. CU can be used in two ways: to guide sequence clustering algorithms and to evaluate clustering results.