Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Space-efficient online computation of quantile summaries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The discrepancy method: randomness and complexity
The discrepancy method: randomness and complexity
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Data streams: algorithms and applications
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Estimating flow distributions from sampled flow statistics
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding (Recently) Frequent Items in Distributed Data Streams
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Sampling in dynamic data streams and applications
SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Holistic aggregates in a networked world: distributed tracking of approximate quantiles
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Sketching streams through the net: distributed approximate query tracking
VLDB '05 Proceedings of the 31st international conference on Very large data bases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Communication-efficient distributed monitoring of thresholded counts
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A geometric approach to monitoring threshold functions over distributed data streams
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
An integrated efficient solution for computing frequent and top-k elements in data streams
ACM Transactions on Database Systems (TODS)
Maintaining bernoulli samples over evolving multisets
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining bounded-size sample synopses of evolving datasets
The VLDB Journal — The International Journal on Very Large Data Bases
Algorithms for distributed functional monitoring
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Shape sensitive geometric monitoring
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal tracking of distributed heavy hitters and quantiles
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Functional Monitoring without Monotonicity
ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
MFCS '09 Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science 2009
Weighted random sampling with a reservoir
Information Processing Letters
Randomized algorithms for tracking distributed count, frequencies, and ranks
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Approximate answers to OLAP queries on streaming data warehouses
Proceedings of the fifteenth international workshop on Data warehousing and OLAP
The continuous distributed monitoring model
ACM SIGMOD Record
Realtime analysis of information diffusion in social media
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The main challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol on the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this article, we present communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent elements, or arrivals within the last w time units. We show that our protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.