Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$

Authors:
Kevin L. Chang
Affiliations:
Max Planck Institute for Computer Science, Saarbrücken, Germany
Venue:
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Year:
2007

Citing 10
Cited 0

The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
A spectral algorithm for learning mixture models

Journal of Computer and System Sciences - Special issue on FOCS 2002
On Learning Mixtures of Heavy-Tailed Distributions

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Streaming and sublinear approximation of entropy and information distances

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The space complexity of pass-efficient algorithms for clustering

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
The spectral method for general mixture models

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a multiple pass streaming algorithm for learning the density function of a mixture of kuniform distributions over rectangles (cells) in ${\mathbb R}^d$, for any d 0. Our learning model is: samples drawn according to the mixture are placed in arbitrary orderin a data stream that may only be accessed sequentially by an algorithm with a very limited random access memory space. Our algorithm makes 2茂戮驴 + 1 passes, for any 茂戮驴 0, and requires memory at most $\tilde O(\epsilon^{-2/\ell}k^2d^4+(2k)^d)$. This exhibits a strong memory-space tradeoff: a few more passes significantly lowers its memory requirements, thus trading one of the two most important resources in streaming computation for the other. Chang and Kannan ? first considered this problem for [1] d= 1, 2.Our learning algorithm is especially appropriate for situations where massive data sets of samples are available, but practical computation with such large inputs requires very restricted models of computation.