Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A fast bit-vector algorithm for approximate string matching based on dynamic programming
Journal of the ACM (JACM)
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Dictionary matching and indexing with errors and don't cares
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Journal of the ACM (JACM)
The fragment assembly string graph
Bioinformatics
ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Compressed indexing and local alignment of DNA
Bioinformatics
Dynamic entropy-compressed sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Bioinformatics
Information Processing Letters
Approximate all-pairs suffix/prefix overlaps
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Unified view of backward backtracking in short read mapping
Algorithms and Applications
Least random suffix/prefix matches in output-sensitive time
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Hi-index | 0.00 |
Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of strings of total length n and an error-rate @e, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k=@?@e@?@?, where @? is the length of the overlap. We propose a new solution for this problem based on backward backtracking (Lam, et al., 2008) and suffix filters (Karkkainen and Na, 2008). Our technique uses nH"k+o(nlog@s)+rlogr bits of space, where H"k is the k-th order entropy and @s the alphabet size. In practice, it is more scalable in terms of space, and comparable in terms of time, than q-gram filters (Rasmussen, et al., 2006). Our method is also easy to parallelize and scales up to millions of DNA reads.