Succinct dictionary matching with no slowdown

  • Authors:
  • Djamal Belazzougui

  • Affiliations:
  • LIAFA, Univ. Paris Diderot-Paris 7, Paris Cedex 13, France

  • Venue:
  • CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size σ, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a representation that occupies O(m log m) bits of space where m ≤ n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log σ + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T|+ occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses O(n log σ) bits of space while answering queries in O(|T| log log n + occ) time. In the paper we also show how the space occupancy can be reduced to m(H0+O(1))+O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that σ mε for any constant 0