Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Authors:
Gregory Valiant;Paul Valiant
Affiliations:
UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the forty-third annual ACM symposium on Theory of computing
Year:
2011

Citing 27
Cited 8

A new polynomial-time algorithm for linear programming

Combinatorica
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
On the rate of multivariate Poisson convergence

Journal of Multivariate Analysis
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms: lower bounds and applications

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
The complexity of approximating entropy

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the Convergence Rate of Good-Turing Estimators

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Estimation of entropy and mutual information

Neural Computation
Testing that distributions are close

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Testing properties of distributions

Testing properties of distributions
Tight Lower Bounds for the Distinct Elements Problem

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Always Good Turing: Asymptotically Optimal Probability Estimation

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
On modeling profiles instead of values

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
The Complexity of Approximating the Entropy

SIAM Journal on Computing
Streaming and sublinear approximation of entropy and information distances

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A near-optimal algorithm for computing the entropy of a stream

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Computing Equilibria in Anonymous Games

FOCS '07 Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
Testing symmetric properties of distributions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Sketching and Streaming Entropy via Approximation Theory

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Discretized Multinomial Distributions and Nash Equilibria in Anonymous Games

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
The average-case complexity of counting distinct elements

Proceedings of the 12th International Conference on Database Theory
Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

SIAM Journal on Computing
The maximum likelihood probability of unique-singleton, ternary, and length-7 patterns

ISIT'09 Proceedings of the 2009 IEEE international conference on Symposium on Information Theory - Volume 2
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating entropy on m bins given fewer than m samples

IEEE Transactions on Information Theory

On approximating the number of relevant variables in a function

APPROX'11/RANDOM'11 Proceedings of the 14th international workshop and 15th international conference on Approximation, randomization, and combinatorial optimization: algorithms and techniques
Approximating and testing k-histogram distributions in sub-linear time

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Learning poisson binomial distributions

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Taming big probability distributions

XRDS: Crossroads, The ACM Magazine for Students - Big Data
Testing Symmetric Properties of Distributions

SIAM Journal on Computing
Testing similar means

ICALP'12 Proceedings of the 39th international colloquium conference on Automata, Languages, and Programming - Volume Part I
On Approximating the Number of Relevant Variables in a Function

ACM Transactions on Computation Theory (TOCT)
Estimating duplication by content-based sampling

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors. Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/log n) samples, where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound on the probability of any element of the domain. Previously, no explicit sublinear--sample algorithms for either of these problems were known. Our algorithm is also computationally extremely efficient, running in time linear in the number of samples used. In the second half of the paper, we provide a matching lower bound of Ω(n/log n) samples for estimating entropy or distribution support size to within an additive constant. The previous lower-bounds on these sample complexities were n/2O(√log n). To show our lower bound, we prove two new and natural multivariate central limit theorems (CLTs); the first uses Stein's method to relate the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance metric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for "generalized multinomial" distributions---a large class of discrete distributions, parameterized by matrices, that represents sums of independent binomial or multinomial distributions, and describes many distributions encountered in computer science. Convergence here is in the strong sense of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.