The performance of MapReduce: an in-depth study

Authors:
Dawei Jiang;Beng Chin Ooi;Lei Shi;Sai Wu
Affiliations:
National University of Singapore;National University of Singapore;National University of Singapore;National University of Singapore
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 14
Cited 39

Performance tradeoffs in read-optimized databases

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
Read-optimized databases, in depth

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers

Proceedings of the 2nd ACM Symposium on Cloud Computing
Hadoop acceleration through network levitated merge

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Building cubes with MapReduce

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Analytics over large-scale multidimensional data: the big data revolution!

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
MapReduce Workload Modeling with Statistical Approach

Journal of Grid Computing
Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
PRISM: privacy-preserving search in mapreduce

PETS'12 Proceedings of the 12th international conference on Privacy Enhancing Technologies
Optimization of analytic data flows for next generation business intelligence applications

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Cloud MapReduce for Monte Carlo bootstrap applied to Metabolic Flux Analysis

Future Generation Computer Systems
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Database research at the National University of Singapore

ACM SIGMOD Record
Data warehousing and OLAP over big data: current challenges and future research directions

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Big data: a research agenda

Proceedings of the 17th International Database Engineering & Applications Symposium
Gunther: search-based auto-tuning of mapreduce

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology

Quantified Score

Hi-index	0.01

Visualization

Abstract

MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.