Non-uniformity issues and workarounds in bounded-size sampling

Authors:
Rainer Gemulla;Peter J. Haas;Wolfgang Lehner
Affiliations:
Max-Planck-Institut für Informatik, Saarbrücken, Germany;IBM Almaden Research Center, San Jose, USA;Technische Universität Dresden, Dresden, Germany
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2013

Citing 10
Cited 0

New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Techniques for Warehousing of Sample Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maintaining bernoulli samples over evolving multisets

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining bounded-size sample synopses of evolving datasets

The VLDB Journal — The International Journal on Very Large Data Bases
Don't let the negatives bring you down: sampling from streams of signed updates

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A variety of schemes have been proposed in the literature to speed up query processing and analytics by incrementally maintaining a bounded-size uniform sample from a dataset in the presence of a sequence of insertion, deletion, and update transactions. These algorithms vary according to whether the dataset is an ordinary set or a multiset and whether the transaction sequence consists only of insertions or can include deletions and updates. We report on subtle non-uniformity issues that we found in a number of these prior bounded-size sampling schemes, including some of our own. We provide workarounds that can avoid the non-uniformity problem; these workarounds are easy to implement and incur negligible additional cost. We also consider the impact of non-uniformity in practice and describe simple statistical tests that can help detect non-uniformity in new algorithms.