Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Tight Lower Bounds for the Distinct Elements Problem
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Bitmap algorithms for counting active flows on high-speed links
IEEE/ACM Transactions on Networking (TON)
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
A comparison of five probabilistic view-size estimation techniques in OLAP
Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Note: Order statistics and estimating cardinalities of massive data sets
Discrete Applied Mathematics
An optimal algorithm for the distinct elements problem
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
Processing a trillion cells per mouse click
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Cardinality estimation has a wide range of applications and is of particular importance in database systems. Various algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them. In this paper, we present a series of improvements to this algorithm that reduce its memory requirements and significantly increase its accuracy for an important range of cardinalities. We have implemented our proposed algorithm for a system at Google and evaluated it empirically, comparing it to the original HyperLogLog algorithm. Like HyperLogLog, our improved algorithm parallelizes perfectly and computes the cardinality estimate in a single pass.