A metric index for approximate string matching

  • Authors:
  • Gonzalo Navarro;Edgar Chávez

  • Affiliations:
  • Center for Web Research, Department of Computer Science, University of Chile, Chile. Blanco Encalada, Santiago, Chile;Escuela de Ciencias Físico-Matemáticas, Universidad Michoacana. Edificio "B", Ciudad Universitaria, Morelia, Mich. México

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2006

Quantified Score

Hi-index 5.23

Visualization

Abstract

We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the occ occurrences of a pattern of length m, permitting up to r differences, in a text of length n over an alphabet of size σ, in average time O(m1+ε + occ) for any ε 0, if r = o(m/logσm) and m ((1 + ε)/ε)logσ n. The index works well up to r m/logσm, where it achieves its maximum average search complexity O(m1+√2+ε + occ). The construction time of the index is O(m1+√2+ε nlog n) and its space is O(m1+√2+εn). This is the first index achieving average search time polynomial in m and independent of n, for r = O(m/logσm). Previous methods achieve this complexity only for r = O(m/logσ n). We also present a simpler scheme needing O(n) space.