Mergeable summaries

Authors:
Pankaj K. Agarwal;Graham Cormode;Zengfeng Huang;Jeff M. Phillips;Zhewei Wei;Ke Yi
Affiliations:
Duke University, Durham, NC;University of Warwick, Coventry, UK;Aarhus University, Aarhus, Denmark;University of Utah, Salt Lake City, UT;Aarhus University, Aarhus, Denmark;Tsinghua University and Hong Kong University of Science and Technology, Beijing, China
Venue:
ACM Transactions on Database Systems (TODS) - Invited papers issue
Year:
2013

Citing 33
Cited 0

Approximations and optimal geometric divide-and-conquer

STOC '91 Proceedings of the twenty-third annual ACM symposium on Theory of computing
On linear-time deterministic algorithms for optimization problems in fixed dimension

Journal of Algorithms
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Improved bounds on the sample complexity of learning

Journal of Computer and System Sciences
The discrepancy method: randomness and complexity

The discrepancy method: randomness and complexity
An Approximate L1-Difference Algorithm for Massive Data Streams

SIAM Journal on Computing
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Medians and beyond: new aggregation techniques for sensor networks

SenSys '04 Proceedings of the 2nd international conference on Embedded networked sensor systems
Finding (Recently) Frequent Items in Distributed Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Power-conserving computation of order-statistics over sensor networks

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
TAG: a Tiny AGgregation service for Ad-Hoc sensor networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Tributaries and deltas: efficient and robust aggregation in sensor network streams

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Range Counting over Multidimensional Data Streams

Discrete & Computational Geometry
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
On distributing symmetric streaming computations

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Algorithms for ε-Approximations of Terrains

ICALP '08 Proceedings of the 35th international colloquium on Automata, Languages and Programming, Part I
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Tight results for clustering and summarizing data streams

Proceedings of the 12th International Conference on Database Theory
Space-optimal heavy hitters with strong error bounds

ACM Transactions on Database Systems (TODS)
Constructive Algorithms for Discrepancy Minimization

FOCS '10 Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science
Sampling based algorithms for quantile computation in sensor networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast moment estimation in data streams in optimal space

Proceedings of the forty-third annual ACM symposium on Theory of computing
On Range Searching in the Group Model and Combinatorial Discrepancy

FOCS '11 Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science
Analyzing graph structure via linear measurements

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Semidefinite optimization in discrepancy theory

Mathematical Programming: Series A and B - Special Issue on ISMP 2012
Constructive Discrepancy Minimization by Walking on the Edges

FOCS '12 Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ϵ-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ϵ); for ϵ-approximate quantiles, there is a deterministic summary of size O((1/ϵ) log(ϵ n)) that has a restricted form of mergeability, and a randomized one of size O((1/ϵ) log3/2(1/ϵ)) with full mergeability. We also extend our results to geometric summaries such as ϵ-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ϵ) log3/2(1/ϵ)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.