An efficient implementation of the d2 distance function for EST clustering: preliminary investigations

  • Authors:
  • Scott Hazelhurst

  • Affiliations:
  • School of Computer Science, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa

  • Venue:
  • SAICSIT '04 Proceedings of the 2004 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The d2 distance function is commonly used in the clustering of DNA sequences such as expressed sequence tags (ESTs), an important biological application. The use of d2 allows approximate string matching to be performed with a good balance between selectivity and sensitivity. The computational challenges of EST clustering make the efficient evaluation of the d2 function an imperative. The paper presents a new incremental algorithm which requires amortised cost of O(m) per evaluation on realistic data sets (where m is the average length of an EST). In addition, two filtering heuristics are presented which improve clustering performance by estimating upper bounds on the d2 scores.