Automated application-level checkpointing of MPI programs

Authors:
Greg Bronevetsky;Daniel Marques;Keshav Pingali;Paul Stodghill
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2003

Citing 15
Cited 44

Transparent optimistic rollback recovery

ACM SIGOPS Operating Systems Review
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
Distributed Algorithms

Distributed Algorithms
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Libckpt: Transparent Checkpointing under Unix

Libckpt: Transparent Checkpointing under Unix
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing

Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware

MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Optimizing Checkpoint Sizes in the C3 System

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Compiler-generated staggered checkpointing

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Mobile MPI programs in computational grids

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Cumulvs: Interacting with High-Performance Scientific Simulations, for Visualization, Steering and Fault Tolerance

International Journal of High Performance Computing Applications
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

ACM SIGOPS Operating Systems Review
Stabilizers: a modular checkpointing abstraction for concurrent functional programs

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Implementing fault-tolerance in real-time systems by automatic program transformations

EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Scalable, fault tolerant membership for MPI tasks on HPC systems

Proceedings of the 20th annual international conference on Supercomputing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Quasi-atomic recovery for distributed agents

Parallel Computing
Modular Checkpointing for Atomicity

Electronic Notes in Theoretical Computer Science (ENTCS)
Compensation of Measurement Overhead in Parallel Performance Profiling

International Journal of High Performance Computing Applications
HySim: a fast simulation framework for embedded software development

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
A fast and generic hybrid simulation approach using C virtual machine

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Automated application-level checkpointing based on live-variable analysis in MPI programs

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
A proposal for error handling in OpenMP

International Journal of Parallel Programming
A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
A novel fault-tolerant parallel algorithm

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Lightweight checkpointing for concurrent ml

Journal of Functional Programming
A proposal for error handling in OpenMP

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
BRRL: a recovery library for main-memory applications in the cloud

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
An effective speedup metric for measuring productivity in large-scale parallel computer systems

The Journal of Supercomputing
A technique for non-invasive application-level checkpointing

The Journal of Supercomputing
New user-guided and ckpt-based checkpointing libraries for parallel MPI applications,

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Models for on-the-fly compensation of measurement overhead in parallel performance profiling

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A hybrid message Logging-CIC protocol for constrained checkpointability

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Execution migration in a heterogeneous-ISA chip multiprocessor

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Programming model support for dependable, elastic cloud applications

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
enhancing fault-tolerance of large-scale MPI scientific applications

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.