Designing good algorithms for MapReduce and beyond

Authors:
Foto N. Afrati;Magdalena Balazinska;Anish Das Sarma;Bill Howe;Semih Salihoglu;Jeffrey D. Ullman
Affiliations:
Google;University of Washington;Google;University of Washington;Stanford University;Stanford University
Venue:
Proceedings of the Third ACM Symposium on Cloud Computing
Year:
2012

Citing 21
Cited 0

Parallel database systems: the future of high performance database systems

Communications of the ACM
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
CloudBurst

Bioinformatics
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Counting triangles and the curse of the last reducer

Proceedings of the 20th international conference on World wide web
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Optimizing Multiway Joins in a Map-Reduce Environment

IEEE Transactions on Knowledge and Data Engineering
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Transitive closure and recursive Datalog implemented on clusters

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Fuzzy Joins Using MapReduce

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Load Balancing in MapReduce Based on Scalable Cardinality Estimates

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Spinning fast iterative data flows

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

As MapReduce/Hadoop grows in importance, we find more exotic applications being written this way. Not every program written for this platform performs as well as we might wish. There are several reasons why a MapReduce program can underperform expectations. One is the need to balance the communication cost of transporting data from the mappers to the reducers against the computation done at the mappers and reducers themselves. A second important issue is selecting the number of rounds of MapReduce. A third issue is that of skew. If wall-clock time is important, then using many different reduce-keys and many compute nodes may minimize the time to finish the job. Yet if the data is uncooperative, and no provision is made to distribute the data evenly, much of the work is done by a single node.