Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
ACM Transactions on Computer Systems (TOCS)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
Fault-tolerance has gained renewed importance with the proliferation of high-performance clusters. However, fault-tolerant systems have not yet been widely adopted commercially because they are either hard to deploy, hard to use, hard to manage, hard to maintain, or hard to justify. We have developed M3, a practical and easily-deployable multiple fault-tolerant MPI system for Myrinet, to satisfy the demand for a fault-tolerant system.In this paper, we run rigorous tests using real-world applications to validate that M3can be used in commercial clusters. We also describe improvements made to our system to solve various problems that arose when deploying it on a commercial cluster.This paper models our system's checkpoint overhead and presents the results of a series of tests using computation- and communication-intensive MPI applications used commercially in various fields of science. The experimental results show that not only does our system conform to various types of running environment well, but that it can also be practically deployed in commercial clusters.