The average-case complexity of counting distinct elements

Authors:
David P. Woodruff
Affiliations:
IBM Almaden, San Jose, CA
Venue:
Proceedings of the 12th International Conference on Database Theory
Year:
2009

Citing 23
Cited 3

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Elements of information theory

Elements of information theory
Randomized algorithms

Randomized algorithms
Tail bounds for occupancy and the satisfiability threshold conjecture

Random Structures & Algorithms
Communication complexity

Communication complexity
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Introduction to Coding Theory

Introduction to Coding Theory
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Optimal space lower bounds for all frequency moments

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Probability and Computing: Randomized Algorithms and Probabilistic Analysis

Probability and Computing: Randomized Algorithms and Probabilistic Analysis
The Complexity of Approximating the Entropy

SIAM Journal on Computing
The complexity of massive data set computations

The complexity of massive data set computations
Approximate quantiles and the order of the stream

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

FOCS '07 Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
Tight lower bounds for selection in randomly ordered streams

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient and private distance approximation in the communication and streaming models

Efficient and private distance approximation in the communication and streaming models
Robust lower bounds for communication and stream computation

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Lower bounds for quantile estimation in random-order and multi-pass streaming

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

An optimal lower bound on the communication complexity of gap-hamming-distance

Proceedings of the forty-third annual ACM symposium on Theory of computing
Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Proceedings of the forty-third annual ACM symposium on Theory of computing
The shifting sands algorithm

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms

Quantified Score

Hi-index	0.01

Visualization

Abstract

We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1 ± ε) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1-pass algorithm requires Ω(1/ε2) bits of space to perform this task. To try to bypass this lower bound, the problem was recently studied in a model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model an Ω(1/ε2) lower bound was established. This is because the adversary can still choose the data arbitrarily. This leaves open the possibility that the problem is only hard under a pathological choice of data, which would be of little practical relevance. We study the average-case complexity of this problem under certain distributions. Namely, we study the case when each successive stream item is drawn independently and uniformly at random from an unknown subset of d items for an unknown value of d. This captures the notion of random uncorrelated data. For a wide range of values of d and n, we design a 1-pass algorithm that bypasses the Ω(1/ε2) lower bound that holds in the adversarial and random-order models, thereby showing that this model admits more space-efficient algorithms. Moreover, the update time of our algorithm is optimal. Despite these positive results, for a certain range of values of d and n we show that estimating the number of distinct elements requires Ω(1/ε2) bits of space even in this model. Our lower bound subsumes previous bounds, showing that even for natural choices of data the problem is hard.