Space-efficient substring occurrence estimation

Authors:
Alessio Orlandi;Rossano Venturini
Affiliations:
University of Pisa, Pisa, Italy;ISTI-CNR, Pisa, Italy
Venue:
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2011

Citing 16
Cited 1

Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Efficient Minimal Perfect Hashing in Nearly Minimal Space

STACS '01 Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Succinct indexes for strings, binary relations and multi-labeled trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Fast prefix search in little space, with applications

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I

On the value of multiple read/write streams for data compression

Information Theory, Combinatorics, and Search Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet £ of size Ã is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases [15, 14, 5]. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing [8, 11, 6, 17] providing an optimal data structure that requires ?(|T|logÃ/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.