Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Space-efficient online computation of quantile summaries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The discrepancy method: randomness and complexity
The discrepancy method: randomness and complexity
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding (Recently) Frequent Items in Distributed Data Streams
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Holistic aggregates in a networked world: distributed tracking of approximate quantiles
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Sketching streams through the net: distributed approximate query tracking
VLDB '05 Proceedings of the 31st international conference on Very large data bases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Communication-efficient distributed monitoring of thresholded counts
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A geometric approach to monitoring threshold functions over distributed data streams
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
An integrated efficient solution for computing frequent and top-k elements in data streams
ACM Transactions on Database Systems (TODS)
Algorithms for distributed functional monitoring
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Shape sensitive geometric monitoring
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal tracking of distributed heavy hitters and quantiles
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tracking distributed aggregates over time-based sliding windows
Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Continuous distributed monitoring: a short survey
Proceedings of the First International Workshop on Algorithms and Models for Distributed Event Processing
Optimal random sampling from distributed streams revisited
DISC'11 Proceedings of the 25th international conference on Distributed computing
Analyzing graph structure via linear measurements
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Lower bounds for number-in-hand multiparty communication complexity, made easy
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Graph sketches: sparsification, spanners, and subgraphs
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Continuous distributed counting for non-monotonic streams
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Tight bounds for distributed functional monitoring
STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Tracking distributed aggregates over time-based sliding windows
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient protocols for distributed classification and optimization
ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Utility-maximizing event stream suppression
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Quality and efficiency for kernel density estimates in large data
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Adaptive stratified reservoir sampling over heterogeneous data streams
Information Systems
Hi-index | 0.00 |
A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.