Brief Communication: Fast embedding methods for clustering tens of thousands of sequences

Authors:
Gordon Blackshields;Mark Larkin;Iain M. Wallace;Andreas Wilm;Desmond G. Higgins
Affiliations:
UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland;UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland;UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland;UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland;UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland
Venue:
Computational Biology and Chemistry
Year:
2008

Citing 3
Cited 0

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Properties of Embedding Methods for Similarity Searching in Metric Spaces

IEEE Transactions on Pattern Analysis and Machine Intelligence
PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires computer memory and time proportional to N^2 for N sequences. For small N or say up to 10000 or so, this can be accomplished in reasonable times for sequences of moderate length. For very large N, however, this becomes increasingly prohibitive. In this paper, we have tested variations on a class of published embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignments. Source code is available on request from the authors.