Introduction to algorithms
BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Estimating Seed Sensitivity on Homogeneous Alignments
BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
An annotated k-deep prefix tree for (1-k)-mer based sequence comparisons
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
Hi-index | 0.00 |
The d2 distance function is commonly used in the clustering of DNA sequences such as expressed sequence tags (ESTs), an important biological application. The use of d2 allows approximate string matching to be performed with a good balance between selectivity and sensitivity. The computational challenges of EST clustering make the efficient evaluation of the d2 function an imperative. The paper presents a new incremental algorithm which requires amortised cost of O(m) per evaluation on realistic data sets (where m is the average length of an EST). In addition, two filtering heuristics are presented which improve clustering performance by estimating upper bounds on the d2 scores.