Space-efficient substring occurrence estimation

  • Authors:
  • Alessio Orlandi;Rossano Venturini

  • Affiliations:
  • University of Pisa, Pisa, Italy;ISTI-CNR, Pisa, Italy

  • Venue:
  • Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet £ of size à is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases [15, 14, 5]. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing [8, 11, 6, 17] providing an optimal data structure that requires ?(|T|logÃ/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.