Upper and lower bounds on the cost of a map-reduce computation

Authors:
Anish Das Sarma;Foto N. Afrati;Semih Salihoglu;Jeffrey D. Ullman
Affiliations:
Google Research;National Technical University of Athens;Stanford University;Stanford University
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 16
Cited 0

Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Constraint solving via fractional edge covers

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Size Bounds and Query Plans for Relational Joins

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Counting triangles and the curse of the last reducer

Proceedings of the 20th international conference on World wide web
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Optimizing Multiway Joins in a Map-Reduce Environment

IEEE Transactions on Knowledge and Data Engineering
Mining of Massive Datasets

Mining of Massive Datasets
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Fuzzy Joins Using MapReduce

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of map-reduce computation. This model enables a generic recipe for discovering lower bounds on communication cost as a function of the maximum number of inputs that can be assigned to one reducer. We use the model to analyze the tradeoff for three problems: finding pairs of strings at Hamming distance d, finding triangles and other patterns in a larger graph, and matrix multiplication. For finding strings of Hamming distance 1, we have upper and lower bounds that match exactly. For triangles and many other graphs, we have upper and lower bounds that are the same to within a constant factor. For the problem of matrix multiplication, we have matching upper and lower bounds for one-round map-reduce algorithms. We are also able to explore two-round map-reduce algorithms for matrix multiplication and show that these never have more communication, for a given reducer size, than the best one-round algorithm, and often have significantly less.