An Analysis of Traces from a Production MapReduce Cluster

Authors:
Soila Kavulya;Jiaqi Tan;Rajeev Gandhi;Priya Narasimhan
Affiliations:
-;-;-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 16
Cited 22

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
Locally Weighted Learning

Artificial Intelligence Review - Special issue on lazy learning
The elusive goal of workload characterization

ACM SIGMETRICS Performance Evaluation Review
Predictive Application-Performance Modeling in a Computational Grid Environment

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The workload on parallel supercomputers: modeling the characteristics of rigid jobs

Journal of Parallel and Distributed Computing
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Mochi: visual log-analysis based tools for debugging hadoop

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Workload characteristics of a multi-cluster supercomputer

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

A hierarchical framework for cross-domain MapReduce execution

Proceedings of the second international workshop on Emerging computational methods for the life sciences
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
Modeling and synthesizing task placement constraints in Google compute clusters

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Delay tails in MapReduce scheduling

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Investigation of data locality and fairness in MapReduce

Proceedings of third international workshop on MapReduce and its Applications Date
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Predicting execution bottlenecks in map-reduce clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Heterogeneity and dynamicity of clouds at scale: Google trace analysis

Proceedings of the Third ACM Symposium on Cloud Computing
On modelling and prediction of total CPU usage for applications in mapreduce environments

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Omega: flexible, scalable schedulers for large compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
A throughput optimal algorithm for map task scheduling in mapreduce with data locality

ACM SIGMETRICS Performance Evaluation Review
A characteristic study on failures of production distributed data-parallel programs

Proceedings of the 2013 International Conference on Software Engineering
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Joint optimization of overlapping phases in MapReduce

Performance Evaluation
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Regression-based utilization prediction algorithms: an empirical investigation

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Joint optimization of overlapping phases in MapReduce

ACM SIGMETRICS Performance Evaluation Review
An improved partitioning mechanism for optimizing massive data analysis using MapReduce

The Journal of Supercomputing
Catch the whole lot in an action: rapid precise packet loss notification in data centers

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for academic research. We characterize resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.