On biased reservoir sampling in the presence of stream evolution

Authors:
Charu C. Aggarwal
Affiliations:
IBM T. J. Watson Research Center, Hawhorne, NY
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 16
Cited 20

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Continuous queries over data streams

ACM SIGMOD Record
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Maintaining time-decaying stream aggregates

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
One-Pass Wavelet Decompositions of Data Streams

IEEE Transactions on Knowledge and Data Engineering
One-pass wavelet synopses for maximum-error metrics

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A framework for projected clustering of high dimensional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Time-decaying sketches for sensor data aggregation

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Sampling time-based sliding windows in bounded space

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A new sampling technique for association rule mining

Journal of Information Science
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Feature Clustering for Data Steering in Dynamic Data Driven Application Systems

ICCS 2009 Proceedings of the 9th International Conference on Computational Science
Approximating sliding windows by cyclic tree-like histograms for efficient range queries

Data & Knowledge Engineering
Stratified reservoir sampling over heterogeneous data streams

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
The orange customer analysis platform

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Time-decaying Sketches for Robust Aggregation of Sensor Data

SIAM Journal on Computing
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Optimal random sampling from distributed streams revisited

DISC'11 Proceedings of the 25th international conference on Distributed computing
Optimal sampling from sliding windows

Journal of Computer and System Sciences
A simple, yet effective and efficient, sliding window sampling algorithm

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Proceedings of the 15th International Conference on Extending Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Space-efficient sampling from social activity streams

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Reservoir sampling techniques in modern data analysis

Proceedings of the Fifth Balkan Conference in Informatics
A survey on concept drift adaptation

ACM Computing Surveys (CSUR)
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The method of reservoir based sampling is often used to pick an unbiased sample from a data stream. A large portion of the unbiased sample may become less relevant over time because of evolution. An analytical or mining task (eg. query estimation) which is specific to only the sample points from a recent time-horizon may provide a very inaccurate result. This is because the size of the relevant sample reduces with the horizon itself. On the other hand, this is precisely the most important case for data stream algorithms, since recent history is frequently analyzed. In such cases, we show that an effective solution is to bias the sample with the use of temporal bias functions. The maintenance of such a sample is non-trivial, since it needs to be dynamically maintained, without knowing the total number of points in advance. We prove some interesting theoretical properties of a large class of memory-less bias functions, which allow for an efficient implementation of the sampling algorithm. We also show that the inclusion of bias in the sampling process introduces a maximum requirement on the reservoir size. This is a nice property since it shows that it may often be possible to maintain the maximum relevant sample with limited storage requirements. We not only illustrate the advantages of the method for the problem of query estimation, but also show that the approach has applicability to broader data mining problems such as evolution analysis and classification.