MPI: a message passing interface
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Optimizing memory system performance for communication in parallel computers
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
OPIOM: off-processor I/O with myrinet
Future Generation Computer Systems - Best papers from symp. on cluster computing and the grid (CCGRID 2001)
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Efficient asynchronous memory copy operations on multi-core systems and I/OAT
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Retrospect: deterministic replay of MPI applications for interactive distributed debugging
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Correlated set coordination in fault tolerant message logging protocols
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault tolerant; most are in need for a seamless recovery framework. Among the automatic fault tolerant techniques proposed for MPI, message logging is preferable for its scalable recovery. The major challenge for message logging protocols is the performance penalty on communications during failure-free periods, mostly coming from the payload copy introduced for each message. In this paper, we investigate different approaches for logging payload and compare their impact on network performance.