Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
A small approximately min-wise independent family of hash functions
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Approximate Aggregation Techniques for Sensor Databases
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Tracking set-expression cardinalities over continuous update streams
The VLDB Journal — The International Journal on Very Large Data Bases
Data streaming algorithms for accurate and efficient measurement of traffic and flow matrices
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast and accurate traffic matrix measurement using adaptive cardinality counting
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Bitmap algorithms for counting active flows on high-speed links
IEEE/ACM Transactions on Networking (TON)
Counting distinct items over update streams
ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Estimating cardinality distributions in network traffic: extended abstract
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An online framework for catching top spreaders and scanners
Computer Networks: The International Journal of Computer and Telecommunications Networking
Hi-index | 0.00 |
Estimating the cardinality (i.e. number of distinct elements) of an arbitrary set expression defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. However, the estimators do not fully utilize the information in the sketches and thus are not statistically efficient. In this paper, we develop a novel statistical model and an efficient yet simple estimator for the cardinalities based on a continuous variant of the well known Flajolet-Martin sketches. Specifically, we show that, for two streams, our estimator has almost the same statistical efficiency as the Maximum Likelihood Estimator (MLE), which is known to be optimal in the sense of Cramer-Rao lower bounds under regular conditions. Moreover, as the number of streams gets larger, our estimator is still computationally simple, but the MLE becomes intractable due to the complexity of the likelihood. Let N be the cardinality of the union of all streams, and |S| be the cardinality of a set expression S to be estimated. For a given relative standard error δ, the memory requirement of our estimator is O(δ-2|S|-1 N log log N), which is superior to state-of-the-art algorithms, especially for large N and small |S|/N where the estimation is most challenging.