The Case for Evaluating MapReduce Performance Using Workload Suites

Authors:
Yanpei Chen;Archana Ganapathi;Rean Griffith;Randy Katz
Affiliations:
-;-;-;-
Venue:
MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
Year:
2011

Citing 0
Cited 24

Verifiable resource accounting for cloud computing services

Proceedings of the 3rd ACM workshop on Cloud computing security workshop
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Delay tails in MapReduce scheduling

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
An Analysis of Provisioning and Allocation Policies for Infrastructure-as-a-Service Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Mirror mirror on the ceiling: flexible wireless links for data centers

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Mirror mirror on the ceiling: flexible wireless links for data centers

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Heterogeneity and dynamicity of clouds at scale: Google trace analysis

Proceedings of the Third ACM Symposium on Cloud Computing
Scheduling mapreduce jobs in HPC clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Towards verifiable resource accounting for outsourced computation

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Utility-aware deferred load balancing in the cloud driven by dynamic pricing of electricity

Proceedings of the Conference on Design, Automation and Test in Europe
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
MROrder: flexible job ordering optimization for online mapreduce workloads

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Journal of Parallel and Distributed Computing
Data Center Power Cost Optimization via Workload Modulation

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
MixApart: decoupled analytics for shared storage systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
SpringFS: bridging agility and performance in elastic distributed storage

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.