Checkpointing in distributed computing systems
Journal of Parallel and Distributed Computing
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
High Performance Cluster Computing: Architectures and Systems
High Performance Cluster Computing: Architectures and Systems
Algorithm-Based Fault Tolerance for FFT Networks
IEEE Transactions on Computers
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
The Ensemble System
Achieving Scalable Cluster System Analysis and Management with a Gossip-Based Network Service
LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
Overlapping of communication and computation and early binding: fundamental mechanisms for improving parallel performance on clusters of workstations
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Algorithm 897: VTDIRECT95: Serial and parallel codes for the global optimization algorithm direct
ACM Transactions on Mathematical Software (TOMS)
International Journal of Parallel Programming
Fault tolerance in an industrial seismic processing application for multicore clusters
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Hi-index | 0.00 |
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.This paper describes the design and implementation of MPI/FT, a high-performance MPI-1.2 implementation enhanced with low-overhead functionality to detect and recover from process failures. The strategy behind MPI/FT is that fault tolerance in message-passing middleware can be optimized based on an application's execution model derived from its communication topology and parallel programming semantics. MPI/FT exploits the specific characteristics of two parallel application execution models in order to optimize performance. MPI/FT also introduces the self-checking thread that monitors the functioning of the middleware itself. User aware checkpointing and user-assisted recovery are compatible with MPI/FT and complement the techniques used here.This paper offers a classification of MPI applications for fault tolerant MPI purposes and MPI/FT implementation discussed here provides different middleware versions specifically tailored to each of the two models studied in detail. The interplay of various parameters affecting the cost of fault tolerance is investigated. Experimental results demonstrate that the approach used to design and implement MPI/FT results in a low-overhead MPI-based fault tolerant communication middleware implementation.