Don't let the negatives bring you down: sampling from streams of signed updates

Authors:
Edith Cohen;Graham Cormode;Nick Duffield
Affiliations:
AT&T Labs-Research, Florham Park, NJ, USA;AT&T Labs-Research, Florham Park, NJ, USA;AT&T Labs-Research, Florham Park, NJ, USA
Venue:
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Year:
2012

Citing 19
Cited 2

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Sampling in dynamic data streams and applications

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Maintaining bernoulli samples over evolving multisets

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sketching unaggregated data streams for subpopulation-size queries

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Algorithms and estimators for accurate summarization of internet traffic

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
1-pass relative-error Lp-sampling with applications

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Tight bounds for Lp samplers, finding duplicates in streams, and related problems

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Streaming algorithms for data in motion

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

Efficient sampling of non-strict turnstile data streams

FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Non-uniformity issues and workarounds in bounded-size sampling

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random sampling has been proven time and time again to be a powerful tool for working with large data. Queries over the full dataset are replaced by approximate queries over the smaller (and hence easier to store and manipulate) sample. The sample constitutes a flexible summary that supports a wide class of queries. But in many applications, datasets are modified with time, and it is desirable to update samples without requiring access to the full underlying datasets. In this paper, we introduce and analyze novel techniques for sampling over dynamic data, modeled as a stream of modifications to weights associated with each key. While sampling schemes designed for stream applications can often readily accommodate positive updates to the dataset, much less is known for the case of negative updates, where weights are reduced or items deleted altogether. We primarily consider the turnstile model of streams, and extend classic schemes to incorporate negative updates. Perhaps surprisingly, the modifications to handle negative updates turn out to be natural and seamless extensions of the well-known positive update-only algorithms. We show that they produce unbiased estimators, and we relate their performance to the behavior of corresponding algorithms on insert-only streams with different parameters. A careful analysis is necessitated, in order to account for the fact that sampling choices for one key now depend on the choices made for other keys. In practice, our solutions turn out to be efficient and accurate. Compared to recent algorithms for Lp sampling which can be applied to this problem, they are significantly more reliable, and dramatically faster.