On distributing symmetric streaming computations

Authors:
Jon Feldman;S. Muthukrishnan;Anastasios Sidiropoulos;Cliff Stein;Zoya Svitkina
Affiliations:
Google Inc., New York, NY;Google Inc., New York, NY;Toyota Technological Institute at Chicago, Chicago, IL;Columbia University, New York, NY;University of Alberta, Alberta, Canada
Venue:
ACM Transactions on Algorithms (TALG)
Year:
2010

Citing 14
Cited 8

Small-bias probability spaces: efficient constructions and applications

SIAM Journal on Computing
Limits to parallel computation: P-completeness theory

Limits to parallel computation: P-completeness theory
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Public vs. private coin flips in one round communication games (extended abstract)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Boolean Circuits, Tensor Ranks, and Communication Complexity

SIAM Journal on Computing
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Randomized Simultaneous Messages: Solution Of A Problem Of Yao In Communication Complexity

CCC '97 Proceedings of the 12th Annual IEEE Conference on Computational Complexity
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Synopsis diffusion for robust aggregation in sensor networks

ACM Transactions on Sensor Networks (TOSN)
Maze recognizing automata and nondeterministic tape complexity

Journal of Computer and System Sciences

Fast clustering using MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Densest subgraph in streaming and MapReduce

Proceedings of the VLDB Endowment
The efficiency of mapreduce in parallel external memory

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Space-round tradeoffs for MapReduce computations

Proceedings of the 26th ACM international conference on Supercomputing
Allowing each node to communicate only once in a distributed system: shared whiteboard models

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Graph drawing in the cloud: privately visualizing relational data using small working storage

GD'12 Proceedings of the 20th international conference on Graph Drawing
Communication steps for parallel query processing

Proceedings of the 32nd symposium on Principles of database systems
SAMOA: a platform for mining big data streams

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common approach for dealing with large datasets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive datasets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distributed over many machines. In practice, obtaining significant speedups using distributed computation has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a streaming algorithm can also be computed by a mud algorithm, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized streaming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems, and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds.