Note: Order statistics and estimating cardinalities of massive data sets

Authors:
Frédéric Giroire
Affiliations:
ALGO project, INRIA Rocquencourt, B.P. 105, 78153 Le Chesnay Cedex, France and MASCOTTE, joint project CNRS-INRIA-UNSA, 2004 Routes des Lucioles, BP 93, F-06902, France
Venue:
Discrete Applied Mathematics
Year:
2009

Citing 10
Cited 5

A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Monitoring very high speed links

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Approximate Aggregation Techniques for Sensor Databases

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Bitmap algorithms for counting active flows on high-speed links

IEEE/ACM Transactions on Networking (TON)
Probabilistic counting

SFCS '83 Proceedings of the 24th Annual Symposium on Foundations of Computer Science

Two improved range-efficient algorithms for F0 estimation

Theoretical Computer Science
Pricing and unresponsive flows purging for global rate enhancement

Journal of Electrical and Computer Engineering
Finding heavy distinct hitters in data streams

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
On cardinality estimation protocols for wireless sensor networks

ADHOC-NOW'11 Proceedings of the 10th international conference on Ad-hoc, mobile, and wireless networks
HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.05

Visualization

Abstract

A new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error of 1M using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in terms of processing speed. For instance, a memory of only 12 kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percent. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic.