Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Elements of statistical computing
Elements of statistical computing
Randomized algorithms
Approximation algorithms for NP-hard problems
Approximation algorithms for NP-hard problems
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Quasi-cubes: exploiting approximations in multidimensional databases
ACM SIGMOD Record
AutoAdmin “what-if” index analysis utility
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An application of mathematical programming to sample allocation
Computational Statistics & Data Analysis
Data cube approximation and histograms via wavelets
Proceedings of the seventh international conference on Information and knowledge management
Approximate computation of multidimensional aggregates of sparse data using wavelets
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Using approximations to scale exploratory data analysis in datacubes
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximating multi-dimensional aggregate range queries over real attributes
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Space-efficient online computation of quantile summaries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Optimizing queries using materialized views: a practical, scalable solution
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Data Mining and Knowledge Discovery
Optimizing Queries with Materialized Views
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Overcoming Limitations of Sampling for Aggregation Queries
Proceedings of the 17th International Conference on Data Engineering
Histogram-Based Approximation of Set-Valued Query-Answers
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Automated Selection of Materialized Views and Indexes in SQL Databases
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate Query Processing Using Wavelets
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Approximate Answers to Aggregate Queries on a Data Cube
SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Answering queries using views: A survey
The VLDB Journal — The International Journal on Very Large Data Bases
Dynamic sample selection for approximate query processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Gossip-Based Computation of Aggregate Information
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data
IEEE Transactions on Knowledge and Data Engineering
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Online maintenance of very large random samples
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Integrating vertical and horizontal partitioning into automated physical database design
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Join-distinct aggregate estimation over update streams
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Robust estimation with sampling and approximate pre-aggregation
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Data stream query processing: a tutorial
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Toward best-effort information extraction
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
LCS-Hist: taming massive high-dimensional data cube compression
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Journal of Data and Information Quality (JDIQ)
Statistical structures for Internet-scale data management
The VLDB Journal — The International Journal on Very Large Data Bases
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Journal of Intelligent Information Systems
Online stratified sampling: evaluating classifiers at web-scale
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Stratified reservoir sampling over heterogeneous data streams
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Towards approximate SQL: infobright's approach
RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
Effective and efficient sampling methods for deep web aggregation queries
Proceedings of the 14th International Conference on Extending Database Technology
Just-in-time analytics on large file systems
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
The VC-dimension of SQL queries and selectivity estimation through sampling
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Effective stratification for low selectivity queries on deep web data sources
Proceedings of the 20th ACM international conference on Information and knowledge management
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Sample-based forecasting exploiting hierarchical time series
Proceedings of the 16th International Database Engineering & Applications Sysmposium
Self-adaptive approximate queries for large-scale information aggregation
International Journal of Web and Grid Services
CS2: a new database synopsis for query estimation
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
BlinkDB: queries with bounded errors and bounded response times on very large data
Proceedings of the 8th ACM European Conference on Computer Systems
Adaptive stratified reservoir sampling over heterogeneous data streams
Information Systems
Hi-index | 0.00 |
The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem where, given a workload of queries, we select a stratified random sample of the original data such that the error in answering the workload queries using the sample is minimized. A key novelty of our approach is that we can tailor the choice of samples to be robust, even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server that demonstrate the superior quality of our method compared to previous work.