Space-optimal heavy hitters with strong error bounds

Authors:
Radu Berinde;Piotr Indyk;Graham Cormode;Martin J. Strauss
Affiliations:
MIT;MIT;AT&T Labs--Research;University of Michigan
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2010

Citing 28
Cited 4

Heavy-tailed probability distributions in the World Wide Web

A practical guide to heavy tails
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient computation of Iceberg cubes with complex measures

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
KDD-Cup 2000 organizers' report: peeling the onion

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
New directions in traffic measurement and accounting

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Querying and mining data streams: you only get one look a tutorial

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Towards Sensor Database Systems

MDM '01 Proceedings of the Second International Conference on Mobile Data Management
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Algorithms for dynamic geometric problems over data streams

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Medians and beyond: new aggregation techniques for sensor networks

SenSys '04 Proceedings of the 2nd international conference on Embedded networked sensor systems
Space complexity of hierarchical heavy hitters in multi-dimensional data streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
One sketch for all: fast algorithms for compressed sensing

Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Estimating entropy over data streams

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A near-optimal algorithm for computing the entropy of a stream

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Near-Optimal Sparse Recovery in the L1 Norm

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Frequent items in streaming data: An experimental evaluation of the state-of-the-art

Data & Knowledge Engineering
Space-optimal heavy hitters with strong error bounds

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory
Compressed sensing

IEEE Transactions on Information Theory

Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Improved counter based algorithms for frequent pairs mining in transactional data streams

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Mergeable summaries

ACM Transactions on Database Systems (TODS) - Invited papers issue
DrunkardMob: billions of random walks on just a PC

Proceedings of the 7th ACM conference on Recommender systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of counter-based algorithms (including the popular and very space-efficient Frequent and SpacesSaving algorithms) provides much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining tail. This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound. This tail guarantee allows these algorithms to solve the sparse recovery problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ∥∥f−∥f*∥1 is close to the best possible error minf′ ∥f′ − f∥1, where f′ ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.