Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

  • Authors:
  • Gregory Valiant;Paul Valiant

  • Affiliations:
  • UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA

  • Venue:
  • Proceedings of the forty-third annual ACM symposium on Theory of computing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors. Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/log n) samples, where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound on the probability of any element of the domain. Previously, no explicit sublinear--sample algorithms for either of these problems were known. Our algorithm is also computationally extremely efficient, running in time linear in the number of samples used. In the second half of the paper, we provide a matching lower bound of Ω(n/log n) samples for estimating entropy or distribution support size to within an additive constant. The previous lower-bounds on these sample complexities were n/2O(√log n). To show our lower bound, we prove two new and natural multivariate central limit theorems (CLTs); the first uses Stein's method to relate the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance metric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for "generalized multinomial" distributions---a large class of discrete distributions, parameterized by matrices, that represents sums of independent binomial or multinomial distributions, and describes many distributions encountered in computer science. Convergence here is in the strong sense of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.