A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems (TODS)
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Selectivity estimation using probabilistic models
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Monitoring very high speed links
IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Approximate Aggregation Techniques for Sensor Databases
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Bitmap algorithms for counting active flows on high-speed links
IEEE/ACM Transactions on Networking (TON)
SFCS '83 Proceedings of the 24th Annual Symposium on Foundations of Computer Science
Two improved range-efficient algorithms for F0 estimation
Theoretical Computer Science
Pricing and unresponsive flows purging for global rate enhancement
Journal of Electrical and Computer Engineering
Finding heavy distinct hitters in data streams
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
On cardinality estimation protocols for wireless sensor networks
ADHOC-NOW'11 Proceedings of the 10th international conference on Ad-hoc, mobile, and wireless networks
Proceedings of the 16th International Conference on Extending Database Technology
Hi-index | 0.05 |
A new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error of 1M using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in terms of processing speed. For instance, a memory of only 12 kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percent. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic.