Collective operations in application-level fault-tolerant MPI

Authors:
Greg Bronevetsky;Daniel Marques;Keshav Pingali;Paul Stodghill
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Year:
2003

Citing 7
Cited 18

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Libckpt: Transparent Checkpointing under Unix

Libckpt: Transparent Checkpointing under Unix

Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Optimizing Checkpoint Sizes in the C3 System

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Mobile MPI programs in computational grids

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Reliability challenges in large systems

Future Generation Computer Systems
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Scalable, fault tolerant membership for MPI tasks on HPC systems

Proceedings of the 20th annual international conference on Supercomputing
Compensation of Measurement Overhead in Parallel Performance Profiling

International Journal of High Performance Computing Applications
HySim: a fast simulation framework for embedded software development

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
FALCON: a system for reliable checkpoint recovery in shared grid environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Reliability challenges in large systems

Future Generation Computer Systems
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A fault-tolerant parallel text searching technique on a cluster of workstations

ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs without global barriers.In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, while dealing with the unique challenges of application-level checkpointing. The protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper, we extend the protocol to handle MPI's collective communication constructs. We also present experimental results that show that the overhead introduced by the protocol for collective operations is small.