The space complexity of pass-efficient algorithms for clustering

Authors:
Kevin L. Chang;Ravi Kannan
Affiliations:
Yale University, New Haven;Yale University, New Haven
Venue:
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Year:
2006

Citing 18
Cited 7

The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On data structures and asymmetric communication complexity

Journal of Computer and System Sciences
Computing on data streams

External memory algorithms
Learning mixtures of arbitrary gaussians

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Pass efficient algorithms for approximating large matrices

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
An Information Statistics Approach to Data Stream and Communication Complexity

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Dynamic Maintenance of Wavelet-Based Histograms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Optimal approximations of the frequency moments of data streams

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Graph distances in the streaming model: the value of space

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms

Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Declaring independence via the sketching of sketches

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Estimating statistical aggregates on probabilistic data streams

ACM Transactions on Database Systems (TODS)
Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Multiple pass streaming algorithms for learning mixtures of distributions in Rd

Theoretical Computer Science
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Lower bounds for quantile estimation in random-order and multi-pass streaming

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present multiple pass streaming algorithms for a basic clustering problem for massive data sets. If our algorithm is allotted 2l passes, it will produce an approximation with error at most ε using Õ(k3/ε2/l) bits of memory, the most critical resource for streaming computation. We demonstrate that this tradeoff between passes and memory allotted is intrinsic to the problem and model of computation by proving lower bounds on the memory requirements of any l pass randomized algorithm that are nearly matched by our upper bounds. To the best of our knowledge, this is the first time nearly matching bounds have been proved for such an exponential tradeoff for randomized computation.In this problem, we are given a set of n points drawn randomly according to a mixture of k uniform distributions and wish to approximate the density function of the mixture. The points are placed in a datastream (possibly in adversarial order), which may only be read sequentially by the algorithm. We argue that this models, among others, the datastream produced by a national census of the incomes of all citizens.