On distributing symmetric streaming computations

Authors:
Jon Feldman;S. Muthukrishnan;Anastasios Sidiropoulos;Cliff Stein;Zoya Svitkina
Affiliations:
Google, Inc., New York, NY;Google, Inc., New York, NY;Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT, Cambridge, MA;Columbia University;Cornell University
Venue:
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2008

Citing 12
Cited 14

Small-bias probability spaces: efficient constructions and applications

SIAM Journal on Computing
Limits to parallel computation: P-completeness theory

Limits to parallel computation: P-completeness theory
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Public vs. private coin flips in one round communication games (extended abstract)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Boolean Circuits, Tensor Ranks, and Communication Complexity

SIAM Journal on Computing
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Randomized Simultaneous Messages: Solution Of A Problem Of Yao In Communication Complexity

CCC '97 Proceedings of the 12th Annual IEEE Conference on Computational Complexity
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Theory research at Google

ACM SIGACT News
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Machine models for query processing

ACM SIGMOD Record
Distributing frequency-dependent data stream computations

CATS '09 Proceedings of the Fifteenth Australasian Symposium on Computing: The Australasian Theory - Volume 94
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Theory of data stream computing: where to go

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On scheduling in map-reduce and flow-shops

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Privacy-preserving access of outsourced data via oblivious RAM simulation

ICALP'11 Proceedings of the 38th international conference on Automata, languages and programming - Volume Part II
Continuous distributed monitoring: a short survey

Proceedings of the First International Workshop on Algorithms and Models for Distributed Event Processing
DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Proceedings of the 2nd ACM Symposium on Cloud Computing
Sorting, searching, and simulation in the mapreduce framework

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
The continuous distributed monitoring model

ACM SIGMOD Record
Mergeable summaries

ACM Transactions on Database Systems (TODS) - Invited papers issue

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common approach for dealing with large data sets is to stream over the input in one pass, and perform computations using sublinear resources. For truly massive data sets, however, even making a single pass over the data is prohibitive. Therefore, streaming computations must be distribued over many machines. In practice, obtaining significant speedups using distributed computations has numerous challenges including synchronization, load balancing, overcoming processor failures, and data distribution. Successful Systems in practice such as Google's MapReduce and Apache's Hadoop address these problems by only allowing a certain class of highly distributable tasks defined by local computations that can be applied in any order to the input. The fundamental question that arises is: How does the class of computational tasks supported by these systems differ from the class for which streaming solutions exist? We introduce a simple algorithmic model for massive, unordered, distributed (mud) computation, as implemented by these systems. We show that in principle, mud algorithms are equivalent in power to symmetric streaming algorithms. More precisely, we show that any symmetric (order-invariant) function that can be computed by a steraming algorithm can also be computed by a mud algorithym, with comparable space and communication complexity. Our simulation uses Savitch's theorem and therefore has superpolynomial time complexity. We extend our simulation result to some natural classes of approximate and randomized steraming algorithms. We also give negative results, using communication complexity arguments to prove that extensions to private randomness, promise problems and indeterminate functions are impossible. We also introduce an extension of the mud model to multiple keys and multiple rounds.