Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

Authors:
Sofya Raskhodnikova;Dana Ron;Amir Shpilka;Adam Smith
Affiliations:
sofya@cse.psu.edu and asmith@cse.psu.edu;danar@eng.tau.ac.il;shpilka@cs.technion.ac.il;-
Venue:
SIAM Journal on Computing
Year:
2009

Citing 0
Cited 15

Testing monotone continuous distributions on high-dimensional real cubes

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Invariance in property testing

Property testing
Testing monotone continuous distributions on high-dimensional real cubes

Property testing
Invariance in property testing

Property testing
Testing monotone continuous distributions on high-dimensional real cubes

Property testing
Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Proceedings of the forty-third annual ACM symposium on Theory of computing
On approximating the number of relevant variables in a function

APPROX'11/RANDOM'11 Proceedings of the 14th international workshop and 15th international conference on Approximation, randomization, and combinatorial optimization: algorithms and techniques
Bounds from a card trick

Journal of Discrete Algorithms
Approximating and testing k-histogram distributions in sub-linear time

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Taming big probability distributions

XRDS: Crossroads, The ACM Magazine for Students - Big Data
Testing Symmetric Properties of Distributions

SIAM Journal on Computing
On the power of conditional samples in distribution testing

Proceedings of the 4th conference on Innovations in Theoretical Computer Science
Testing Closeness of Discrete Distributions

Journal of the ACM (JACM)
On Approximating the Number of Relevant Variables in a Function

ACM Transactions on Computation Theory (TOCT)
Estimating duplication by content-based sampling

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least $\frac{1}{n}$. This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length $n$. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor $\alpha1$ requires $\Theta(\frac{n}{\alpha^2})$ queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in $n$ lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, $\mathsf{X}_1$ and $\mathsf{X}_2$, with very different expectations and the following condition on the first $k$ moments: $\mathsf{E}[\mathsf{X}_1]/\mathsf{E}[\mathsf{X}_2] = \mathsf{E}[\mathsf{X}_1^2]/\mathsf{E}[\mathsf{X}_2^2] = \cdots = \mathsf{E}[\mathsf{X}_1^k]/\E[\mathsf{X}_2^k]$. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.