A fast bit-vector algorithm for approximate string matching based on dynamic programming

Authors:
Gene Myers
Affiliations:
Univ. of Arizona, Tucson
Venue:
Journal of the ACM (JACM)
Year:
1999

Citing 11
Cited 70

Fast string matching with k-differences

Journal of Computer and System Sciences - 26th IEEE Conference on Foundations of Computer Science, October 21-23, 1985
An improved algorithm for approximate string matching

SIAM Journal on Computing
A new approach to text searching

Communications of the ACM
Fast text searching: allowing errors

Communications of the ACM
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Approximate string matching using within-word parallelism

Software—Practice & Experience
The String-to-String Correction Problem

Journal of the ACM (JACM)
Theoretical and Empirical Comparisons of Approximate String Matching Algorithms

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
A Faster Algorithm for Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching

Fast and simple character classes and bounded gaps pattern matching, with application to protein searching

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
Fast Evolutionary Chains

SOFSEM '00 Proceedings of the 27th Conference on Current Trends in Theory and Practice of Informatics
The Max-Shift Algorithm for Approximate String Matching

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Fast Implementations of Automata Computations

CIAA '00 Revised Papers from the 5th International Conference on Implementation and Application of Automata
Cascade Decompositions are Bit-Vector Algorithms

CIAA '01 Revised Papers from the 6th International Conference on Implementation and Application of Automata
Better Filtering with Gapped q-Grams

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Faster Bit-Parallel Approximate String Matching

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Approximate pattern matching and transitive closure logics

Theoretical Computer Science
A bit-vector algorithm for computing Levenshtein and Damerau edit distances

Nordic Journal of Computing - Special issue: Selected papers of the Prague Stringology conference (PSC'02), September 23-24, 2002
Fast multipattern search algorithms for intrusion detection

Fundamenta Informaticae - Special issue on computing patterns in strings
Speeding-up Hirschberg and Hunt-Szymanski LCS algorithms

Fundamenta Informaticae - Special issue on computing patterns in strings
Approximate string matching on Ziv-Lempel compressed text

Journal of Discrete Algorithms
From cascade decompositions to bit-vector algorithms

Theoretical Computer Science - Implementation and application automata
Average-optimal single and multiple approximate string matching

Journal of Experimental Algorithmics (JEA)
Bases of Motifs for Generating Repeated Patterns with Wild Cards

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Increased bit-parallelism for approximate and multiple string matching

Journal of Experimental Algorithmics (JEA)
Engineering efficient metric indexes

Pattern Recognition Letters
A programmable array processor architecture for flexible approximate string matching algorithms

Journal of Parallel and Distributed Computing
Sequence-similarity kernels for SVMs to detect anomalies in system calls

Neurocomputing
Efficient String Matching in Huffman Compressed Texts

Fundamenta Informaticae
On-line Approximate String Matching in Natural Language

Fundamenta Informaticae
High-error approximate dictionary search using estimate hash comparisons

Software—Practice & Experience
Bit-parallel string matching under Hamming distance in O(n⌈m/w⌉) worst case time

Information Processing Letters
Efficient computations of gapped string kernels based on suffix kernel

Neurocomputing
Homology search with binary and trinary scoring matrices

International Journal of Bioinformatics Research and Applications
Processor array architectures for flexible approximate string matching

Journal of Systems Architecture: the EUROMICRO Journal
Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching

Information Processing Letters
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Simple-regular expressions and languages

Journal of Automata, Languages and Combinatorics
Fast and compact regular expression matching

Theoretical Computer Science
Indexed Hierarchical Approximate String Matching

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Nested Counters in Bit-Parallel String Matching

LATA '09 Proceedings of the 3rd International Conference on Language and Automata Theory and Applications
BAC Overlap Identification Based on Bit-Vectors

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Identification of design motifs with pattern matching algorithms

Information and Software Technology
Average-optimal multiple approximate string matching

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
An efficient algorithm for finding gene-specific probes for DNA microarrays

ISBRA'07 Proceedings of the 3rd international conference on Bioinformatics research and applications
Segmentation and annotation of audiovisual recordings based on automated speech recognition

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
Tuning approximate Boyer-Moore for gene sequences

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Approximate string matching with Lempel-Ziv compressed indexes

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Data analysis and bioinformatics

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
A hash trie filter method for approximate string matching in genomic databases

Applied Intelligence
Approximate all-pairs suffix/prefix overlaps

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Implementation of a programmable array processor architecture for approximate string matching algorithms on FPGAs

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Seed-set construction by equi-entropy partitioning for efficient and sensitive short-read mapping

WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Fast bit-vector algorithms for approximate string matching under indel distance

SOFSEM'05 Proceedings of the 31st international conference on Theory and Practice of Computer Science
New algorithms for regular expression matching

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
Using gap-insensitive string kernel to detect masquerading

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
On bit-parallel processing of multi-byte text

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Efficient q-gram filters for finding all ε-matches over a given length

RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
New bit-parallel indel-distance algorithm

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
Approximate all-pairs suffix/prefix overlaps

Information and Computation
A fast bit-parallel algorithm for gapped string kernels

ICONIP'06 Proceedings of the 13 international conference on Neural Information Processing - Volume Part I
A parallel algorithm for fixed-length approximate string-matching with k-mismatches

Algorithms and Applications
Fast and cache-oblivious dynamic programming with local dependencies

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Efficient similarity search in very large string sets

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
On-line Approximate String Matching in Natural Language

Fundamenta Informaticae
Fast Multipattern Search Algorithms for Intrusion Detection

Fundamenta Informaticae - Computing Patterns in Strings
Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms

Fundamenta Informaticae - Computing Patterns in Strings
Efficient String Matching in Huffman Compressed Texts

Fundamenta Informaticae
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)
Parallel processing for stepwise generalisation method on multi-core PC cluster

International Journal of Knowledge and Web Intelligence
Efficient high-similarity string comparison: the waterfall algorithm

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Scalable string similarity search/join with approximate seeds and multiple backtracking

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Evaluating the acceleration of typical scientific problems on the GPU

Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Fast Longest Common Subsequence with General Integer Scoring Support on GPUs

Proceedings of Programming Models and Applications on Multicores and Manycores

Quantified Score

Hi-index	0.01

Visualization

Abstract

The approximate string matching problem is to find all locations at which a query of lengthm matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in either O(nm/w) or O(nm log &sgr;/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and &sgr; is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and smallm. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx w entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.