Towards automatic optimization of MapReduce programs

Authors:
Shivnath Babu
Affiliations:
Duke University, Durham, NC, USA
Venue:
Proceedings of the 1st ACM symposium on Cloud computing
Year:
2010

Citing 15
Cited 31

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
A comparison of sorting algorithms for the connection machine CM-2

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Adaptive parallel aggregation algorithms

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Sampling Issues in Parallel Database Systems

EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Adaptive self-tuning memory in DB2

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Towards optimizing hadoop provisioning in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Automated experiment-driven management of (database) systems

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems

The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Towards improved load balancing for data intensive distributed computing

Proceedings of the 2011 ACM Symposium on Applied Computing
Automatic performance debugging of SPMD-style parallel programs

Journal of Parallel and Distributed Computing
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Exploring MapReduce efficiency with highly-distributed data

Proceedings of the second international workshop on MapReduce and its applications
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Proceedings of the 2nd ACM Symposium on Cloud Computing
Verifiable resource accounting for cloud computing services

Proceedings of the 3rd ACM workshop on Cloud computing security workshop
Purlieus: locality-aware resource allocation for MapReduce in a cloud

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A Load-Driven Task Scheduler with Adaptive DSC for MapReduce

GREENCOM '11 Proceedings of the 2011 IEEE/ACM International Conference on Green Computing and Communications
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Hirundo: a mechanism for automated production of optimized data stream graphs

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
MapReduce Workload Modeling with Statistical Approach

Journal of Grid Computing
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Automatic task slots assignment in Hadoop MapReduce

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
On modelling and prediction of total CPU usage for applications in mapreduce environments

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
ClouDiA: a deployment advisor for public clouds

Proceedings of the VLDB Endowment
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A vision for personalized service level agreements in the cloud

Proceedings of the Second Workshop on Data Analytics in the Cloud
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Gunther: search-based auto-tuning of mapreduce

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
A framework for an in-depth comparison of scale-up and scale-out

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Automatic optimization of stream programs via source program operator graph transformations

Distributed and Parallel Databases
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Journal of Parallel and Distributed Computing
Speeding-up codon analysis on the cloud with local MapReduce aggregation

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Timely and cost-effective processing of large datasets has become a critical ingredient for the success of many academic, government, and industrial organizations. The combination of MapReduce frameworks and cloud computing is an attractive proposition for these organizations. However, even to run a single program in a MapReduce framework, a number of tuning parameters have to be set by users or system administrators. Users often run into performance problems because they don't know how to set these parameters, or because they don't even know that these parameters exist. With MapReduce being a relatively new technology, it is not easy to find qualified administrators. In this position paper, we make a case for techniques to automate the setting of tuning parameters for MapReduce programs. The objective is to provide good out-of-the-box performance for ad hoc MapReduce programs run on large datasets. This feature can go a long way towards improving the productivity of users who lack the skills to optimize programs themselves due to lack of familiarity with MapReduce or with the data being processed.