Small-bias probability spaces: efficient constructions and applications
SIAM Journal on Computing
Limits to parallel computation: P-completeness theory
Limits to parallel computation: P-completeness theory
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Public vs. private coin flips in one round communication games (extended abstract)
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Boolean Circuits, Tensor Ranks, and Communication Complexity
SIAM Journal on Computing
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Estimating Rarity and Similarity over Data Stream Windows
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Randomized Simultaneous Messages: Solution Of A Problem Of Yao In Communication Complexity
CCC '97 Proceedings of the 12th Annual IEEE Conference on Computational Complexity
Stable distributions, pseudorandom generators, embeddings, and data stream computation
Journal of the ACM (JACM)
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Synopsis diffusion for robust aggregation in sensor networks
ACM Transactions on Sensor Networks (TOSN)
Maze recognizing automata and nondeterministic tape complexity
Journal of Computer and System Sciences
Fast clustering using MapReduce
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Densest subgraph in streaming and MapReduce
Proceedings of the VLDB Endowment
The efficiency of mapreduce in parallel external memory
LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Space-round tradeoffs for MapReduce computations
Proceedings of the 26th ACM international conference on Supercomputing
Allowing each node to communicate only once in a distributed system: shared whiteboard models
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Graph drawing in the cloud: privately visualizing relational data using small working storage
GD'12 Proceedings of the 20th international conference on Graph Drawing
Communication steps for parallel query processing
Proceedings of the 32nd symposium on Principles of database systems
SAMOA: a platform for mining big data streams
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems, and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds.