Approximate String Matching in DNA Sequences

  • Authors:
  • Lok-Lam Cheng;David W. Cheung;Siu-Ming Yiu

  • Affiliations:
  • -;-;-

  • Venue:
  • DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Approximate string matching on large DNA sequencesdata is very important in bioinformatics. Some studies haveshown that suffix tree is an efficient data structure for approximate string matching. It performs better than suffixarray if the data structure can be stored entirely in the memory. However, our study find that suffix array is much better than suffix tree for indexing the DNA sequences sincethe data structure has to be created and stored on the diskdue to its size. We propose a novel auxiliary data structurewhich greatly improves the efficiency of suffix array in theapproximate string matching problem in the external memory model. The second problem we have tackled is the parallel approximate matching in DNA sequence. We propose2 novel parallel algorithms for this problem and implementthem on a PC cluster. The result shows that when the errorallowed is small, a direct partitionin of the array over themachines in the cluster is a more efficient approach. On theother hand, when the error allowed is large, partitioningthe data over the machines is a better approach.