Estimating the number of substring matches in long string databases

Authors:
Jinuk Bae;Sukho Lee
Affiliations:
School of Electrical Engineering and Computer Science, Seoul National University, Korea;School of Electrical Engineering and Computer Science, Seoul National University, Korea
Venue:
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Year:
2005

Citing 12
Cited 0

Data compression with finite windows

Communications of the ACM
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
On effective multi-dimensional indexing for strings

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Fundamentals of Data Structures in C++

Fundamentals of Data Structures in C++
Counting Twig Matches in a Tree

Proceedings of the 17th International Conference on Data Engineering
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Indexing Text with Approximate q-Grams

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Estimating the number of substring matches is one of problems that estimate alphanumeric selectivity using statistical information for strings. In the context of alphanumeric selectivity estimation, a CS-tree (Count Suffix Tree), which is a variation of a suffix tree, has been used as a basic data structure to store statistical information for substrings. However, even though the CS-tree is useful to keep information about short strings such as name or title, the CS-tree has two drawbacks: one is that some count values that the CS-tree keeps can be incorrect, and the other is that it is almost impossible to build the CS-tree over long strings such as biological sequences. Therefore, for estimating the number of substring matches in long strings, we propose a CQ-tree (Count Q-gram Tree), which keeps the exact count values of all substrings of length q or below q located in the long strings, and can be constructed in one scan of data strings. Furthermore, on the basis of the CQ-tree, we return the lower and upper bounds that the number of occurrences of a query can reach to, together with the estimated count of the query pattern. These bounds are mathematically proved. To the best of our knowledge, our work is the first one that presents the lower and upper bounds among research activities about alphanumeric selectivity estimation.