A new polynomial-time algorithm for linear programming
Combinatorica
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
On the rate of multivariate Poisson convergence
Journal of Multivariate Analysis
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms: lower bounds and applications
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
The complexity of approximating entropy
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
On the Convergence Rate of Good-Turing Estimators
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Estimation of entropy and mutual information
Neural Computation
Testing that distributions are close
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Testing properties of distributions
Testing properties of distributions
Tight Lower Bounds for the Distinct Elements Problem
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Always Good Turing: Asymptotically Optimal Probability Estimation
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
On modeling profiles instead of values
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
The Complexity of Approximating the Entropy
SIAM Journal on Computing
Streaming and sublinear approximation of entropy and information distances
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A near-optimal algorithm for computing the entropy of a stream
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Computing Equilibria in Anonymous Games
FOCS '07 Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science
Testing symmetric properties of distributions
STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Sketching and Streaming Entropy via Approximation Theory
FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Discretized Multinomial Distributions and Nash Equilibria in Anonymous Games
FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
The average-case complexity of counting distinct elements
Proceedings of the 12th International Conference on Database Theory
Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem
SIAM Journal on Computing
The maximum likelihood probability of unique-singleton, ternary, and length-7 patterns
ISIT'09 Proceedings of the 2009 IEEE international conference on Symposium on Information Theory - Volume 2
An optimal algorithm for the distinct elements problem
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating entropy on m bins given fewer than m samples
IEEE Transactions on Information Theory
On approximating the number of relevant variables in a function
APPROX'11/RANDOM'11 Proceedings of the 14th international workshop and 15th international conference on Approximation, randomization, and combinatorial optimization: algorithms and techniques
Approximating and testing k-histogram distributions in sub-linear time
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Learning poisson binomial distributions
STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Taming big probability distributions
XRDS: Crossroads, The ACM Magazine for Students - Big Data
Testing Symmetric Properties of Distributions
SIAM Journal on Computing
ICALP'12 Proceedings of the 39th international colloquium conference on Automata, Languages, and Programming - Volume Part I
On Approximating the Number of Relevant Variables in a Function
ACM Transactions on Computation Theory (TOCT)
Estimating duplication by content-based sampling
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors. Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/log n) samples, where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound on the probability of any element of the domain. Previously, no explicit sublinear--sample algorithms for either of these problems were known. Our algorithm is also computationally extremely efficient, running in time linear in the number of samples used. In the second half of the paper, we provide a matching lower bound of Ω(n/log n) samples for estimating entropy or distribution support size to within an additive constant. The previous lower-bounds on these sample complexities were n/2O(√log n). To show our lower bound, we prove two new and natural multivariate central limit theorems (CLTs); the first uses Stein's method to relate the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance metric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for "generalized multinomial" distributions---a large class of discrete distributions, parameterized by matrices, that represents sums of independent binomial or multinomial distributions, and describes many distributions encountered in computer science. Convergence here is in the strong sense of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.