A metric index for approximate string matching

Authors:
Gonzalo Navarro;Edgar Chávez
Affiliations:
Center for Web Research, Department of Computer Science, University of Chile, Chile. Blanco Encalada, Santiago, Chile;Escuela de Ciencias Físico-Matemáticas, Universidad Michoacana. Edificio "B", Ciudad Universitaria, Morelia, Mich. México
Venue:
Theoretical Computer Science
Year:
2006

Citing 28
Cited 3

Transducers and repetitions

Theoretical Computer Science
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Towards an analysis of range query performance in spatial data structures

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
On packing R-trees

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A model for the prediction of R-tree performance

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Multidimensional access methods

ACM Computing Surveys (CSUR)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Searching in metric spaces

ACM Computing Surveys (CSUR)
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
Combinatorial Algorithms on Words

Combinatorial Algorithms on Words
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Probabilistic Analysis of Generalized Suffix Trees (Extended Abstract)

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
A Metric Index for Approximate String Matching

LATIN '02 Proceedings of the 5th Latin American Symposium on Theoretical Informatics
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
A compact space decomposition for effective metric indexing

Pattern Recognition Letters
A new method for approximate indexing and dictionarylookup with one error

Information Processing Letters
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming

Testing embeddability between metric spaces

CATS '08 Proceedings of the fourteenth symposium on Computing: the Australasian theory - Volume 77
Real-Time String Filtering of Large Databases Implemented Via a Combination of Artificial Neural Networks

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part II
Fast index for approximate string matching

Journal of Discrete Algorithms

Quantified Score

Hi-index	5.23

Visualization

Abstract

We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the occ occurrences of a pattern of length m, permitting up to r differences, in a text of length n over an alphabet of size σ, in average time O(m1+ε + occ) for any ε 0, if r = o(m/logσm) and m ((1 + ε)/ε)logσ n. The index works well up to r m/logσm, where it achieves its maximum average search complexity O(m1+√2+ε + occ). The construction time of the index is O(m1+√2+ε nlog n) and its space is O(m1+√2+εn). This is the first index achieving average search time polynomial in m and independent of n, for r = O(m/logσm). Previous methods achieve this complexity only for r = O(m/logσ n). We also present a simpler scheme needing O(n) space.