Practical selectivity estimation through adaptive sampling
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Efficient sampling strategies for relational database operations
ICDT Selected papers of the 4th international conference on Database theory
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Fuzzy queries in multimedia database systems
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Substring selectivity estimation
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A probabilistic model of information retrieval: development and comparative experiments
Information Processing and Management: an International Journal
Modeling score distributions for combining the outputs of search engines
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Impact transformation: effective and efficient web retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Evaluating Top-k Selection Queries
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
One-dimensional and multi-dimensional substring selectivity estimation
The VLDB Journal — The International Journal on Very Large Data Bases
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Selectivity estimation for fuzzy string predicates in large data sets
VLDB '05 Proceedings of the 31st international conference on Very large data bases
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Hi-index | 0.00 |
Estimating the approximate result size of a query before its execution based on small summary statistics is important for query optimization in database systems and for other facets of query processing. This also holds for queries over text databases. Research on selectivity estimation for such queries has focused on Boolean retrieval, i.e., a document may be relevant for the query or not. But with the coalescence of database and information retrieval (IR) technology, selectivity estimation for other, more sophisticated relevance functions is gaining importance as well. These models generate a query-specific distribution of the documents over the [0, 1]-interval. With document distributions, selectivity estimation means estimating how many documents are how similar to a given query. The problem is much more complex than selectivity estimation in the Boolean context: Beside document frequency, query results also depend on other characteristics such as term frequencies and document lengths. Selectivity estimation must take them into account as well. This paper proposes and evaluates a technique for estimating the result of retrieval queries with non-Boolean relevance functions. It estimates discretized document distributions over the range of the relevance function. Despite the complexity, compared to Boolean selectivity estimation, it requires little additional data, and the additional data can be stored in existing data structures with little extensions. Our evaluation demonstrates the effectiveness of our technique.