Theory of linear and integer programming
Theory of linear and integer programming
A Generalization of the Suffix Tree to Square Matrices, with Applications
SIAM Journal on Computing
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
On the construction of classes of suffix trees for square matrices: algorithms and applications
Information and Computation
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling (extended abstract)
PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Substring selectivity estimation
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Selectivity Estimation in the Presence of Alphanumeric Correlations
ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Multi-Dimensional Substring Selectivity Estimation
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Universality of Serial Histograms
VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Using histograms to estimate answer sizes for XML queries
Information Systems - Special issue: Best papers from EDBT 2002
Estimating Answer Sizes for XML Queries
EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
CXHist: an on-line classification-based histogram for XML string selectivity estimation
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Estimating the selectivity of approximate string queries
ACM Transactions on Database Systems (TODS)
OASIS: an online and accurate technique for local-alignment searches on biological sequences
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Scalable regular expression matching on data streams
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Improved count suffix trees for natural language data
IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Efficient discovery of unusual patterns in time series
New Generation Computing
Result-size estimation for information-retrieval subqueries
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Selectivity estimation for hybrid queries over text-rich data graphs
Proceedings of the 16th International Conference on Extending Database Technology
Hi-index | 0.00 |
With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.