New sampling-based summary statistics for improving approximate query answers

Authors:
Phillip B. Gibbons;Yossi Matias
Affiliations:
Information Sciences Research Center, Bell Laboratories;Department of Computer Science, Tel-Aviv University
Venue:
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Year:
1998

Citing 24
Cited 166

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Approximate counting: a detailed analysis

BIT - Ellis Horwood series in artificial intelligence
Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Processing queries for first-few answers

CIKM '96 Proceedings of the fifth international conference on Information and knowledge management
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Dynamic generation of discrete random variates

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Approximate data structures with applications

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Counting large numbers of events in small registers

Communications of the ACM
APPROXIMATE: A Query Processor that Produces Monotonically Improving Approximate Answers

IEEE Transactions on Knowledge and Data Engineering
Maintenance of Materialized Views of Sampling Queries

Proceedings of the Eighth International Conference on Data Engineering
Random Sampling from Pseudo-Ranked B+ Trees

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Recovering Information from Summary Data

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Query processing and optimization in Oracle Rdb

The VLDB Journal — The International Journal on Very Large Data Bases

Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data cube approximation and histograms via wavelets

Proceedings of the seventh international conference on Information and knowledge management
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Exact and approximate aggregation in constraint query languages

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient and flexible value sampling

ACM SIGPLAN Notices
Global optimization of histograms

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Efficient and flexible value sampling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Probabilistic query models for transaction data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
New directions in traffic measurement and accounting

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter
Dwarf: shrinking the PetaCube

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distributed streams algorithms for sliding windows

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Approximate query processing model for mobile computing

Information organization and databases
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
Informix under CONTROL: Online Query Processing

Data Mining and Knowledge Discovery
Approximate Query Answering Using Data Warehouse Striping

Journal of Intelligent Information Systems - Special issue on data warehousing and knowledge discovery
Testing properties of directed graphs: acyclicity and connectivity

Random Structures & Algorithms
Parallel frequent set counting

Parallel Computing - Parallel data-intensive algorithms and applications
Interactive Data Analysis: The Control Project

Computer
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Automatic tuning of data synopses

Information Systems - Special issue: Best papers from EDBT 2002
Approximated trial and error analysis in scientific databases

Information Systems - Special issue: Best papers from EDBT 2002
Optimizing Scientific Databases for Client Side Data Processing

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
A Framework for the Physical Design Problem for Data Synopses

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Estimating Range Queries Using Aggregate Data with Integrity Constraints: A Probabilistic Approach

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Aqua: A Fast Decision Support Systems Using Approximate Query Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Online Dynamic Reordering for Interactive Data Processing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
On Linear-Spline Based Histograms

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Approximate Query Answering Using Data Warehouse Striping

DaWaK '01 Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery
Time-Interval Sampling for Improved Estimations in Data Warehouses

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Testing Acyclicity of Directed Graphs in Sublinear Time

ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Online Subpath Profiling

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
One-dimensional and multi-dimensional substring selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Online dynamic reordering

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate query processing using wavelets

The VLDB Journal — The International Journal on Very Large Data Bases
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
One-Pass Wavelet Decompositions of Data Streams

IEEE Transactions on Knowledge and Data Engineering
New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice

ACM Transactions on Computer Systems (TOCS)
Distributed top-k monitoring

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Identifying frequent items in sliding windows over on-line packet streams

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
DSQoS-distributed architecture providing QoS in summary warehouses

DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP
Dynamically maintaining frequent items over a data stream

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Taming the underlying challenges of reliable multihop routing in sensor networks

Proceedings of the 1st international conference on Embedded networked sensor systems
Efficient dynamic mining of constrained frequent sets

ACM Transactions on Database Systems (TODS)
Dependency detection in MobiMine: a systems perspective

Information Sciences—Informatics and Computer Science: An International Journal - special issue: Knowledge discovery from distributed information sources
Querying about the Past, the Present, and the Future in Spatio-Temporal Databases

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Finding frequent items in data streams

Theoretical Computer Science - Special issue on automata, languages and programming
Use and Maintenance of Histograms for Large Scientific Database Access Planning: A Case Study of a Pharmaceutical Data Repository

Journal of Intelligent Information Systems
Resource allocation in a middleware for streaming data

MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Finding hot query patterns over an XQuery stream

The VLDB Journal — The International Journal on Very Large Data Bases
Language and Compiler Support for Adaptive Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Venn Sampling: A Novel Prediction Technique for Moving Objects

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Finding (Recently) Frequent Items in Distributed Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Selectivity estimators for multidimensional range queries over real attributes

The VLDB Journal — The International Journal on Very Large Data Bases
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
What's hot and what's not: tracking most frequent items dynamically

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
A robust system for accurate real-time summaries of internet traffic

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP

Proceedings of the 8th ACM international workshop on Data warehousing and OLAP
Maintaining significant stream statistics over sliding windows

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Improving range-sum query evaluation on data cubes via polynomial approximation

Data & Knowledge Engineering
An accuracy-aware compression technique for multidimensional data cubes

Proceedings of the 2006 ACM symposium on Applied computing
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Answering queries using materialized views with minimum size

The VLDB Journal — The International Journal on Very Large Data Bases
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
On biased reservoir sampling in the presence of stream evolution

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Supporting dynamic migration in tightly coupled grid applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Mining evolving data streams for frequent patterns

Pattern Recognition
Evaluation of probabilistic queries over imprecise data in constantly-evolving environments

Information Systems
Approximate range---sum query answering on data cubes with probabilistic guarantees

Journal of Intelligent Information Systems
Error minimization in approximate range aggregates

Data & Knowledge Engineering
Automated worm fingerprinting

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Maintaining bernoulli samples over evolving multisets

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sketching unaggregated data streams for subpopulation-size queries

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
GridRod: a dynamic runtime scheduler for grid workflows

Proceedings of the 21st annual international conference on Supercomputing
Dissemination of compressed historical information in sensor networks

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Adaptive index structures

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient exploration of large scientific databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
ROLAP implementations of the data cube

ACM Computing Surveys (CSUR)
A geometric approach to monitoring threshold functions over distributed data streams

ACM Transactions on Database Systems (TODS)
Sampling streaming data with replacement

Computational Statistics & Data Analysis
Algorithms and estimators for accurate summarization of internet traffic

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Robust estimation with sampling and approximate pre-aggregation

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
XWAVE: optimal and approximate extended wavelets

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Model-driven data acquisition in sensor networks

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Answering aggregation queries in a secure system model

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Deterministic algorithms for sampling count data

Data & Knowledge Engineering
Probabilistic lossy counting: an efficient algorithm for finding heavy hitters

ACM SIGCOMM Computer Communication Review
A scalable sampling scheme for clustering in network traffic analysis

Proceedings of the 2nd international conference on Scalable information systems
Statistical supports for mining sequential patterns and improving the incremental update process on data streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Confident estimation for multistage measurement sampling and aggregation

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Finding frequent items in probabilistic data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Confidence bounds for sampling-based group by estimates

ACM Transactions on Database Systems (TODS)
Exploiting Spatio-temporal Correlations for Data Processing in Sensor Networks

GeoSensor Networks
Memory Efficient Algorithm for Mining Recent Frequent Items in a Stream

RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
A Probabilistic Framework for Building Privacy-Preserving Synopses of Multi-dimensional Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Plot Query Processing with Wavelets

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Improving estimation accuracy of aggregate queries on data cubes

Proceedings of the ACM 11th international workshop on Data warehousing and OLAP
On Finding Frequent Elements in a Data Stream

APPROX '07/RANDOM '07 Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
AMID: Approximation of MultI-measured Data using SVD

Information Sciences: an International Journal
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data reduction for data analysis

ECC'08 Proceedings of the 2nd conference on European computing conference
A hardware platform for efficient worm outbreak detection

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Improving estimation accuracy of aggregate queries on data cubes

Data & Knowledge Engineering
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
A top-down approach for compressing data cubes under the simultaneous evaluation of multiple hierarchical range queries

Journal of Intelligent Information Systems
Critical infrastructure protection: Resource efficient sampling to improve detection of less frequent patterns in network traffic

Journal of Network and Computer Applications
Parallel computing for data reduction

AIKED'10 Proceedings of the 9th WSEAS international conference on Artificial intelligence, knowledge engineering and data bases
Metric forensics: a multi-level approach for mining volatile graphs

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining top-K frequent itemsets through progressive sampling

Data Mining and Knowledge Discovery
A parallel algorithm to compute data synopsis

WSEAS Transactions on Information Science and Applications
Gossip-based distribution estimation in peer-to-peer networks

IPTPS'08 Proceedings of the 7th international conference on Peer-to-peer systems
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Identifying the challenges for optimizing the process to achieve reproducible results in e-science applications

PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
Distributed frequent items detection on uncertain data

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Finding heavy distinct hitters in data streams

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
DevoFlow: scaling flow management for high-performance networks

Proceedings of the ACM SIGCOMM 2011 conference
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Optimal sampling from sliding windows

Journal of Computer and System Sciences
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms

Information Sciences: an International Journal
A hierarchy-driven compression technique for advanced OLAP visualization of multidimensional data cubes

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Supporting efficient distributed top-k monitoring

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Continuous sampling from distributed streams

Journal of the ACM (JACM)
Toward automated large-scale information integration and discovery

Data Management in a Connected World
A randomized algorithm for finding frequent elements in streams using o(loglogn) space

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Proceedings of the 15th International Conference on Extending Database Technology
Don't let the negatives bring you down: sampling from streams of signed updates

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
A clustered Dwarf structure to speed up queries on data cubes

DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
Histograms as statistical estimators for aggregate queries

Information Systems
CR-PRECIS: a deterministic summary structure for update data streams

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Metadata for approximate query answering systems

Advances in Software Engineering
Spreader classification based on optimal dynamic bit sharing

IEEE/ACM Transactions on Networking (TON)
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems
Mining frequent items in data stream using time fading model

Information Sciences: an International Journal
Non-uniformity issues and workarounds in bounded-size sampling

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse.