Approximate String Matching in DNA Sequences

Authors:
Lok-Lam Cheng;David W. Cheung;Siu-Ming Yiu
Affiliations:
-;-;-
Venue:
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Year:
2003

Citing 0
Cited 5

Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
EXTRA: a system for example-based translation assistance

Machine Translation
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
An efficient algorithm for finding gene-specific probes for DNA microarrays

ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
Data analysis and bioinformatics

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate string matching on large DNA sequencesdata is very important in bioinformatics. Some studies haveshown that suffix tree is an efficient data structure for approximate string matching. It performs better than suffixarray if the data structure can be stored entirely in the memory. However, our study find that suffix array is much better than suffix tree for indexing the DNA sequences sincethe data structure has to be created and stored on the diskdue to its size. We propose a novel auxiliary data structurewhich greatly improves the efficiency of suffix array in theapproximate string matching problem in the external memory model. The second problem we have tackled is the parallel approximate matching in DNA sequence. We propose2 novel parallel algorithms for this problem and implementthem on a PC cluster. The result shows that when the errorallowed is small, a direct partitionin of the array over themachines in the cluster is a more efficient approach. On theother hand, when the error allowed is large, partitioningthe data over the machines is a better approach.