An efficient implementation of the d2 distance function for EST clustering: preliminary investigations

Authors:
Scott Hazelhurst
Affiliations:
School of Computer Science, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa
Venue:
SAICSIT '04 Proceedings of the 2004 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Year:
2004

Citing 3
Cited 1

Introduction to algorithms

Introduction to algorithms
A Method for Evaluating the Quality of String Dissimilarity Measures and Clustering Algorithms for EST Clustering

BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Estimating Seed Sensitivity on Homogeneous Alignments

BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering

An annotated k-deep prefix tree for (1-k)-mer based sequence comparisons

Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The d2 distance function is commonly used in the clustering of DNA sequences such as expressed sequence tags (ESTs), an important biological application. The use of d2 allows approximate string matching to be performed with a good balance between selectivity and sensitivity. The computational challenges of EST clustering make the efficient evaluation of the d2 function an imperative. The paper presents a new incremental algorithm which requires amortised cost of O(m) per evaluation on realistic data sets (where m is the average length of an EST). In addition, two filtering heuristics are presented which improve clustering performance by estimating upper bounds on the d2 scores.