Congressional samples for approximate answering of group-by queries

Authors:
Swarup Acharya;Phillip B. Gibbons;Viswanath Poosala
Affiliations:
Information Sciences Research Center, Bell Laboratories, 600 Mountain Avenue, Murray Hill NJ;Information Sciences Research Center, Bell Laboratories, 600 Mountain Avenue, Murray Hill NJ;Information Sciences Research Center, Bell Laboratories, 600 Mountain Avenue, Murray Hill NJ
Venue:
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Year:
2000

Citing 12
Cited 58

Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Aqua: A Fast Decision Support Systems Using Approximate Query Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Including Group-By in Query Optimization

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A robust, optimization-based approach for approximate answering of aggregate queries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient aggregation over objects with extent

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dwarf: shrinking the PetaCube

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Compressing SQL workloads

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
Approximate Query Answering Using Data Warehouse Striping

Journal of Intelligent Information Systems - Special issue on data warehousing and knowledge discovery
Continuous queries over data streams

ACM SIGMOD Record
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
On Linear-Spline Based Histograms

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Approximate Query Answering Using Data Warehouse Striping

DaWaK '01 Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery
Time-Interval Sampling for Improved Estimations in Data Warehouses

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TiNA: a scheme for temporal coherency-aware in-network aggregation

Proceedings of the 3rd ACM international workshop on Data engineering for wireless and mobile access
Hierarchical dwarfs for the rollup cube

DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP
DSQoS-distributed architecture providing QoS in summary warehouses

DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query sampling in DB2 Universal Database

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Balancing energy efficiency and quality of aggregate data in sensor networks

The VLDB Journal — The International Journal on Very Large Data Bases
Venn Sampling: A Novel Prediction Technique for Moving Objects

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Sample-Based Quality Estimation of Query Results in Relational Database Environments

IEEE Transactions on Knowledge and Data Engineering
Derby/S: a DBMS for sample-based query answering

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Measuring Data Abstraction Quality in Multiresolution Visualizations

IEEE Transactions on Visualization and Computer Graphics
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Error minimization in approximate range aggregates

Data & Knowledge Engineering
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
ROLAP implementations of the data cube

ACM Computing Surveys (CSUR)
Estimating the output cardinality of partial preaggregation with a measure of clusteredness

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Primitives for workload summarization and implications for SQL

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Robust estimation with sampling and approximate pre-aggregation

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
The polynomial complexity of fully materialized coalesced cubes

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Supporting time-constrained SQL queries in oracle

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Proactive and reactive multi-dimensional histogram maintenance for selectivity estimation

Journal of Systems and Software
Confidence bounds for sampling-based group by estimates

ACM Transactions on Database Systems (TODS)
Maintaining very large random samples using the geometric file

The VLDB Journal — The International Journal on Very Large Data Bases
A survey of top-k query processing techniques in relational database systems

ACM Computing Surveys (CSUR)
SNQL: a query language for sensor network databases

TELE-INFO'08 Proceedings of the 7th WSEAS International Conference on Telecommunications and Informatics
Linked Bernoulli Synopses: Sampling along Foreign Keys

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Sample synopses for approximate answering of group-by queries

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Turbo-charging estimate convergence in DBO

Proceedings of the VLDB Endowment
Revisiting the cube lifecycle in the presence of hierarchies

The VLDB Journal — The International Journal on Very Large Data Bases
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Continuous sampling for online aggregation over multiple queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Stratified reservoir sampling over heterogeneous data streams

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
A sample advisor for approximate query processing

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
A comparison between approximate counting and sampling methods for frequent pattern mining on data streams

Intelligent Data Analysis
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Deferred maintenance of disk-based random samples

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Hierarchical group-based sampling

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Approximate answers to OLAP queries on streaming data warehouses

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
A clustered Dwarf structure to speed up queries on data cubes

DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
BlinkDB: queries with bounded errors and bounded response times on very large data

Proceedings of the 8th ACM European Conference on Computer Systems
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems
Optimizing Sample Design for Approximate Query Processing

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex decision support queries using precomputed summary statistics, such as samples. Decision support queries routinely segment the data into groups and then aggregate the information in each group (group-by queries). Depending on the data, there can be a wide disparity between the number of data items in each group. As a result, approximate answers based on uniform random samples of the data can result in poor accuracy for groups with very few data items, since such groups will be represented in the sample by very few (often zero) tuples.In this paper, we propose a general class of techniques for obtaining fast, highly-accurate answers for group-by queries. These techniques rely on precomputed non-uniform (biased) samples of the data. In particular, we propose congressional samples, a hybrid union of uniform and biased samples. Given a fixed amount of space, congressional samples seek to maximize the accuracy for all possible group-by queries on a set of columns. We present a one pass algorithm for constructing a congressional sample and use this technique to also incrementally maintain the sample up-to-date without accessing the base relation. We also evaluate query rewriting strategies for providing approximate answers from congressional samples. Finally, we conduct an extensive set of experiments on the TPC-D database, which demonstrates the efficacy of the techniques proposed.