A first order approximation to the optimum checkpoint interval
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Fault Management of Parallel Applications for High-Performance Computing
IEEE Transactions on Computers
MapReduce for Data Intensive Scientific Analyses
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Bioinformatics
Performance under Failures of DAG-based Parallel Computing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Towards Efficient MapReduce Using MPI
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
International Journal of High Performance Computing Applications
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
DryadLINQ for Scientific Analyses
E-SCIENCE '09 Proceedings of the 2009 Fifth IEEE International Conference on e-Science
MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Checkpointing vs. Migration for Post-Petascale Supercomputers
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
HAMA: An Efficient Matrix Computation with the MapReduce Framework
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Performance under Failures of MapReduce Applications
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Design patterns for scientific applications in DryadLINQ CTP
Proceedings of the second international workshop on Data intensive computing in the clouds
CHAIO: Enabling HPC Applications on Data-Intensive File Systems
ICPP '12 Proceedings of the 2012 41st International Conference on Parallel Processing
Hi-index | 0.00 |
MPI has been the de facto standard of parallel programming for decades. There has been an increasing concern about the reliability of MPI applications in recent years, partially due to the inefficiency of parallel checkpointing. MapReduce is a new programming model originally introduced to handle massive data processing. There are numerous efforts recently that transform classical MPI based scientific applications to MapReduce, due to the merits of easy programming, automatic parallelism, and fault tolerance of MapReduce. However, the stricter synchronization primitive supported by MapReduce also imposes considerable overhead. While the failure-free performance comparison between MPI and MapReduce has been investigated, there exists little work in comparing the two programming models under failures. In this paper, we propose an analytical approach to quantifying the capabilities of the two programming models to tolerate failures for a comparison. We also carry out extensive numerical analysis to study the impact of different parameters on fault tolerance. This work can be used by the HPC community for various purposes in making critical decisions. For example, it helps algorithm designers to answer the question such as, at which scale should we give up MPI and use MapReduce as the programming model for a better performance under the presence of failures?