Succinct approximate counting of skewed data

Authors:
David Talbot
Affiliations:
Google Inc., Mountain View, CA
Venue:
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Year:
2009

Citing 7
Cited 5

Randomized algorithms

Randomized algorithms
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Counting large numbers of events in small registers

Communications of the ACM
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Exact and approximate membership testers

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms

Online generation of locality sensitive hash signatures

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Storing the web in memory: space efficient language models with constant time retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Flexible approximate counting

Proceedings of the 15th Symposium on International Database Engineering & Applications
Streaming analysis of discourse participants

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Practical data analysis relies on the ability to count observations of objects succinctly and efficiently. Unfortunately the space usage of an exact estimator grows with the size of the a priori set from which objects are drawn while the time required to maintain such an estimator grows with the size of the data set. We present static and on-line approximation schemes that avoid these limitations when approximate frequency estimates are acceptable. Our Log-Frequency Sketch extends the approximate counting algorithm of Morris [1978] to estimate frequencies with bounded relative error via a single pass over a data set. It uses constant space per object when the frequencies follow a power law and can be maintained in constant time per observation. We give an (ε, δ)-approximation scheme which we verify empirically on a large natural language data set where, for instance, 95 percent of frequencies are estimated with relative error less than 0.25 using fewer than 11 bits per object in the static case and 15 bits per object on-line.