Physical database design for relational databases
ACM Transactions on Database Systems (TODS)
Processing aggregate relational queries with hard time constraints
SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Practical selectivity estimation through adaptive sampling
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
On estimating the size of projections
ICDT '90 Proceedings of the third international conference on database theory on Database theory
Error-constrained COUNT query evaluation in relational databases
SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation
SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Efficient sampling strategies for relational database operations
ICDT Selected papers of the 4th international conference on Database theory
Randomized algorithms
Balancing histogram optimality and practicality for query result size estimation
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling (extended abstract)
PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Statistical estimators for relational algebra expressions
Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
An overview of query optimization in relational systems
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
AutoAdmin “what-if” index analysis utility
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A comparison of selectivity estimators for range queries on metric attributes
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Synopsis data structures for massive data sets
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximating multi-dimensional aggregate range queries over real attributes
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sampling from databases using B+-trees
Proceedings of the ninth international conference on Information and knowledge management
Optimal and approximate computation of summary statistics for range aggregates
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space-efficient online computation of quantile summaries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Support vector machine active learning for image retrieval
MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
New directions in traffic measurement and accounting
IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast incremental maintenance of approximate histograms
ACM Transactions on Database Systems (TODS)
On Issues of Instance Selection
Data Mining and Knowledge Discovery
Automating Statistics Management for Query Optimizers
IEEE Transactions on Knowledge and Data Engineering
Probabilistic Optimization of Top N Queries
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Online Dynamic Reordering for Interactive Data Processing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
Frequency Estimation of Internet Packet Streams with Limited Space
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
The VLDB Journal — The International Journal on Very Large Data Bases
Comparing Data Streams Using Hamming Norms (How to Zero In)
IEEE Transactions on Knowledge and Data Engineering
Handbook of data mining and knowledge discovery
A Pareto model for OLAP view size estimation
CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
IEEE Transactions on Knowledge and Data Engineering
A Selectivity Model for Fragmented Relations: Applied in Information Retrieval
IEEE Transactions on Knowledge and Data Engineering
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query sampling in DB2 Universal Database
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Approximately uniform random sampling in sensor networks
DMSN '04 Proceeedings of the 1st international workshop on Data management for sensor networks: in conjunction with VLDB 2004
Maintaining Implicated Statistics in Constrained Environments
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Selectivity estimators for multidimensional range queries over real attributes
The VLDB Journal — The International Journal on Very Large Data Bases
A robust system for accurate real-time summaries of internet traffic
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast and approximate stream mining of quantiles and frequencies using graphics processors
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
VLDB '05 Proceedings of the 31st international conference on Very large data bases
The space complexity of pass-efficient algorithms for clustering
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Synopses for query optimization: A space-complexity perspective
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Approximate quantiles and the order of the stream
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
To search or to crawl?: towards a query optimizer for text-centric tasks
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
High-throughput sketch update on a low-power stream processor
Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Cardinality estimation using sample views with quality assurance
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The power of slicing in internet flow measurement
IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Towards higher disk head utilization: extracting free bandwidth from busy disk drives
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
IEEE Transactions on Visualization and Computer Graphics
Efficient Approximate Query Processing in Peer-to-Peer Networks
IEEE Transactions on Knowledge and Data Engineering
Comparing data streams using Hamming norms (how to zero in)
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Computational Linguistics
Towards a query optimizer for text-centric tasks
ACM Transactions on Database Systems (TODS)
Sampling from databases using B$^+$-Trees
Intelligent Data Analysis
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Bloom histogram: path selectivity estimation for XML data with updates
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Self-tuning database systems: a decade of progress
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Distinct value estimation on peer-to-peer networks
Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
A survey of top-k query processing techniques in relational database systems
ACM Computing Surveys (CSUR)
ACM Transactions on Computer Systems (TOCS)
A quality-aware optimizer for information extraction
ACM Transactions on Database Systems (TODS)
A divide-and-conquer recursive approach for scaling up instance selection algorithms
Data Mining and Knowledge Discovery
by chance enhancing interaction with large data sets through statistical sampling
Proceedings of the Working Conference on Advanced Visual Interfaces
Statistical structures for Internet-scale data management
The VLDB Journal — The International Journal on Very Large Data Bases
Building data synopses within a known maximum error bound
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Result-size estimation for information-retrieval subqueries
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A sample advisor for approximate query processing
ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Data integration with dependent sources
Proceedings of the 14th International Conference on Extending Database Technology
Real-time approximate Range Motif discovery & data redundancy removal algorithm
Proceedings of the 14th International Conference on Extending Database Technology
Optimizing data partitioning for data-parallel computing
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
The VC-dimension of SQL queries and selectivity estimation through sampling
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Intelligent statistics management in sybase ASE 15.0
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
HASE: a hybrid approach to selectivity estimation for conjunctive predicates
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
The pipelined set cover problem
ICDT'05 Proceedings of the 10th international conference on Database Theory
A scalable supervised algorithm for dimensionality reduction on streaming data
Information Sciences: an International Journal
Processing count queries over event streams at multiple time granularities
Information Sciences: an International Journal
Approximating and testing k-histogram distributions in sub-linear time
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
HEDC: a histogram estimator for data in the cloud
Proceedings of the fourth international workshop on Cloud data management
Efficiently adapting graphical models for selectivity estimation
The VLDB Journal — The International Journal on Very Large Data Bases
Optimus: a dynamic rewriting framework for data-parallel execution plans
Proceedings of the 8th ACM European Conference on Computer Systems
Manipulation of Training Sets for Improving Data Mining Coverage-Driven Verification
Journal of Electronic Testing: Theory and Applications
Indexing for summary queries: Theory and practice
ACM Transactions on Database Systems (TODS)
Estimating duplication by content-based sampling
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
PREDIcT: towards predicting the runtime of large scale iterative analytics
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.