Approximate string matching using compressed suffix arrays

Authors:
Trinh N. D. Huynh;Wing-Kai Hon;Tak-Wah Lam;Wing-Kin Sung
Affiliations:
School of Computing, National University of Singapore, Singapore;Department of Computer Science and Information Systems, The University of Hong Kong, Hong Kong;Department of Computer Science and Information Systems, The University of Hong Kong, Hong Kong;School of Computing, National University of Singapore, Singapore
Venue:
Theoretical Computer Science
Year:
2006

Citing 17
Cited 11

Fast parallel and serial approximate string matching

Journal of Algorithms
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The String-to-String Correction Problem

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Faster algorithms for string matching with k mismatches

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A fast string searching algorithm

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
A Faster Algorithm for Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching

Compressed full-text indexes

ACM Computing Surveys (CSUR)
Dynamic Fully-Compressed Suffix Trees

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Indexing circular patterns

WALCOM'08 Proceedings of the 2nd international conference on Algorithms and computation
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Parallel and distributed compressed indexes

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Fully compressed suffix trees

ACM Transactions on Algorithms (TALG)
A randomized numerical aligner (rNA)

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Indexed multi-pattern matching

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
A randomized Numerical Aligner (rNA)

Journal of Computer and System Sciences
Efficient indexing techniques for record matching and deduplication

International Journal of Computational Vision and Robotics

Quantified Score

Hi-index	5.23

Visualization

Abstract

Let T be a text of length n and P be a pattern of length m, both strings over a fixed finite alphabet A. The k-difference (k-mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P. In this paper we investigate a well-studied case in which T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster. We give a solution using an O(n log n) bits indexing data structure with O(|A|kmkċmax(k, log n) +occ) query time, where occ is the number of occurrences. The best previous result requires O(n log n) bits indexing data structure and gives O(|A|kmk+2 + occ) query time. Our solution also allows us to exploit compressed suffix arrays to reduce the indexing space to O(n) bits, while increasing the query time by an O(log n) factor only.