Pass-Efficient Algorithms for Learning Mixtures of Uniform Distributions

  • Authors:
  • Kevin L. Chang;Ravi Kannan

  • Affiliations:
  • kevin.chang@yahoo-inc.com;kannan100@gmail.com

  • Venue:
  • SIAM Journal on Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present multiple pass streaming algorithms for a basic statistical clustering problem for massive data sets. If our algorithm is allotted $2\ell$ passes, it will produce an approximation with error at most $\epsilon$ using $\tilde{O}(k^3/\epsilon^{2/\ell})$ bits of memory, the most critical resource for streaming computation. We demonstrate that this tradeoff between passes and memory allotted is intrinsic to the problem and model of computation by proving lower bounds on the memory requirements of any $\ell$ pass randomized algorithm that are nearly matched by our upper bounds. In this problem, we are given a set of $n$ points drawn randomly according to a mixture of $k$ uniform distributions and wish to approximate the density function of the mixture. The points are placed in a data stream (possibly in adversarial order), which may only be read in sequential passes by the algorithm. The algorithm is quite general and can be adapted to solve the problems of learning a mixture of linear distributions in $\mathbb{R}$ and a mixture of uniform distributions in $\mathbb{R}^2$.