Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
The string B-tree: a new data structure for string search in external memory and its applications
Journal of the ACM (JACM)
Substring selectivity estimation
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
An analysis of the Burrows—Wheeler transform
Journal of the ACM (JACM)
Efficient Minimal Perfect Hashing in Nearly Minimal Space
STACS '01 Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
SIAM Journal on Computing
ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Succinct indexes for strings, binary relations and multi-labeled trees
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed text indexes: From theory to practice
Journal of Experimental Algorithmics (JEA)
Fast prefix search in little space, with applications
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
On the value of multiple read/write streams for data compression
Information Theory, Combinatorics, and Search Theory
Hi-index | 0.00 |
We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet £ of size à is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases [15, 14, 5]. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing [8, 11, 6, 17] providing an optimal data structure that requires ?(|T|logÃ/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.