Performance comparison under failures of MPI and MapReduce: An analytical approach

Authors:
Hui Jin;Xian-He Sun
Affiliations:
-;-
Venue:
Future Generation Computer Systems
Year:
2013

Citing 25
Cited 0

A first order approximation to the optimum checkpoint interval

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Fault Management of Parallel Applications for High-Performance Computing

IEEE Transactions on Computers
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBurst

Bioinformatics
Performance under Failures of DAG-based Parallel Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Towards Efficient MapReduce Using MPI

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
DryadLINQ for Scientific Analyses

E-SCIENCE '09 Proceedings of the 2009 Fifth IEEE International Conference on e-Science
MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Checkpointing vs. Migration for Post-Petascale Supercomputers

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
HAMA: An Efficient Matrix Computation with the MapReduce Framework

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Performance under Failures of MapReduce Applications

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Design patterns for scientific applications in DryadLINQ CTP

Proceedings of the second international workshop on Data intensive computing in the clouds
CHAIO: Enabling HPC Applications on Data-Intensive File Systems

ICPP '12 Proceedings of the 2012 41st International Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

MPI has been the de facto standard of parallel programming for decades. There has been an increasing concern about the reliability of MPI applications in recent years, partially due to the inefficiency of parallel checkpointing. MapReduce is a new programming model originally introduced to handle massive data processing. There are numerous efforts recently that transform classical MPI based scientific applications to MapReduce, due to the merits of easy programming, automatic parallelism, and fault tolerance of MapReduce. However, the stricter synchronization primitive supported by MapReduce also imposes considerable overhead. While the failure-free performance comparison between MPI and MapReduce has been investigated, there exists little work in comparing the two programming models under failures. In this paper, we propose an analytical approach to quantifying the capabilities of the two programming models to tolerate failures for a comparison. We also carry out extensive numerical analysis to study the impact of different parameters on fault tolerance. This work can be used by the HPC community for various purposes in making critical decisions. For example, it helps algorithm designers to answer the question such as, at which scale should we give up MPI and use MapReduce as the programming model for a better performance under the presence of failures?