High Similarity Sequence Comparison in Clustering Large Sequence Databases

Authors:
Lorie Dudoignon;Eric Glemet;Hendrik Cornelis Heus;Mathieu Raffinot
Affiliations:
-;-;-;-
Venue:
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Year:
2002

Citing 7
Cited 1

Practical parallel union-find algorithms for transitive closure and clustering

International Journal of Parallel Programming
The distribution of subword counts is usually normal

European Journal of Combinatorics
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Efficiency of a Good But Not Linear Set Union Algorithm

Journal of the ACM (JACM)
Faster algorithms for string matching with k mismatches

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Fast and simple character classes and bounded gaps pattern matching, with application to protein searching

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching

A multimedia data base browsing system

Proceedings of the 1st international workshop on Computer vision meets databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a fast algorithm for sequence clustering and searching which works with large sequence datab ases. It uses a strictly defined similarity measure. The algorithm is faster than conventional EST clustering approaches because its complexity is directly related to the number of subwords shared by the sequences. Furthermore, the algorithm also works withproteic sequences and large sequences like entire chromosomes. We present a theoretical study of our approach and provide experimental results.