The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Incremental clustering and dynamic information retrieval
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On data structures and asymmetric communication complexity
Journal of Computer and System Sciences
External memory algorithms
Learning mixtures of arbitrary gaussians
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Fast, small-space algorithms for approximate histogram maintenance
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Pass efficient algorithms for approximating large matrices
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
An Information Statistics Approach to Data Stream and Communication Complexity
FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Dynamic Maintenance of Wavelet-Based Histograms
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Better streaming algorithms for clustering problems
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Learning Mixtures of Gaussians
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Optimal approximations of the frequency moments of data streams
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Graph distances in the streaming model: the value of space
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Estimating statistical aggregates on probabilistic data streams
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Declaring independence via the sketching of sketches
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Estimating statistical aggregates on probabilistic data streams
ACM Transactions on Database Systems (TODS)
Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$
ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Multiple pass streaming algorithms for learning mixtures of distributions in Rd
Theoretical Computer Science
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Lower bounds for quantile estimation in random-order and multi-pass streaming
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Hi-index | 0.00 |
We present multiple pass streaming algorithms for a basic clustering problem for massive data sets. If our algorithm is allotted 2l passes, it will produce an approximation with error at most ε using Õ(k3/ε2/l) bits of memory, the most critical resource for streaming computation. We demonstrate that this tradeoff between passes and memory allotted is intrinsic to the problem and model of computation by proving lower bounds on the memory requirements of any l pass randomized algorithm that are nearly matched by our upper bounds. To the best of our knowledge, this is the first time nearly matching bounds have been proved for such an exponential tradeoff for randomized computation.In this problem, we are given a set of n points drawn randomly according to a mixture of k uniform distributions and wish to approximate the density function of the mixture. The points are placed in a datastream (possibly in adversarial order), which may only be read sequentially by the algorithm. We argue that this models, among others, the datastream produced by a national census of the incomes of all citizens.