Scalable string similarity search/join with approximate seeds and multiple backtracking

Authors:
Enrico Siragusa;David Weese;Knut Reinert
Affiliations:
Freie Universität Berlin, Berlin, Germany;Freie Universität Berlin, Berlin, Germany;Freie Universität Berlin, Berlin, Germany
Venue:
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Year:
2013

Citing 7
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
A Fast Algorithm on Average for All-Against-All Sequence Matching

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Better external memory suffix array construction

Journal of Experimental Algorithmics (JEA)
Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics
RazerS 3

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present in this paper scalable algorithms for optimal string similarity search and join. Our methods are variations of those applied in Masai [15], our recently published tool for mapping high-throughput DNA sequencing data with unpreceded speed and accuracy. The key features of our approach are filtration with approximate seeds and methods for multiple backtracking. Approximate seeds, compared to exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds. Combined together, these two methods significantly speed up string similarity search and join operations. Our tool is implemented in C++ and OpenMP using the SeqAn library. The source code is distributed under the BSD license and can be freely downloaded from http://www.seqan.de/projects/edbt2013.