Automatic optimization for MapReduce programs

Authors:
Eaman Jahani;Michael J. Cafarella;Christopher Ré
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Wisconsin, Madison, WI
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 20
Cited 27

Efficient representations and abstractions for quantifying and exploiting data reference locality

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Database Architecture Optimized for the New Bottleneck: Memory Access

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Complete and Efficient Algebraic Compiler for XQuery

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
LINQ: reconciling object, relations and XML in the .NET framework

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MRBench: A Benchmark for MapReduce Framework

ICPADS '08 Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Efficiency matters!

ACM SIGOPS Operating Systems Review
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment

Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
MapReduce Workload Modeling with Statistical Approach

Journal of Grid Computing
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
ClouDiA: a deployment advisor for public clouds

Proceedings of the VLDB Endowment
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Modeling I/O interference for data intensive distributed applications

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Gunther: search-based auto-tuning of mapreduce

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
Speeding-up codon analysis on the cloud with local MapReduce aggregation

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store-style techniques for projections, etc), but existing systems do not apply them, substantially because free-form user code obscures the true data operation being performed. For example, a selection in SQL is easily detected, but a selection in a MapReduce program is embedded in Java code along with lots of other program logic. We could ask the programmer to provide explicit hints about the program's data semantics, but one of MapReduce's attractions is precisely that it does not ask the user for such information. This paper covers Manimal, which automatically analyzes MapReduce programs and applies appropriate data-aware optimizations, thereby requiring no additional help at all from the programmer. We show that Manimal successfully detects optimization opportunities across a range of data operations, and that it yields speedups of up to 1,121% on previously-written MapReduce programs.