Privacy-preserving statistical estimation with optimal convergence rates
Proceedings of the forty-third annual ACM symposium on Theory of computing
Hi-index | 0.00 |
We present multiple pass streaming algorithms for a basic statistical clustering problem for massive data sets. If our algorithm is allotted $2\ell$ passes, it will produce an approximation with error at most $\epsilon$ using $\tilde{O}(k^3/\epsilon^{2/\ell})$ bits of memory, the most critical resource for streaming computation. We demonstrate that this tradeoff between passes and memory allotted is intrinsic to the problem and model of computation by proving lower bounds on the memory requirements of any $\ell$ pass randomized algorithm that are nearly matched by our upper bounds. In this problem, we are given a set of $n$ points drawn randomly according to a mixture of $k$ uniform distributions and wish to approximate the density function of the mixture. The points are placed in a data stream (possibly in adversarial order), which may only be read in sequential passes by the algorithm. The algorithm is quite general and can be adapted to solve the problems of learning a mixture of linear distributions in $\mathbb{R}$ and a mixture of uniform distributions in $\mathbb{R}^2$.