In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Authors:
Gang Wang;Xiaoguang Liu;Ang Li;Fan Zhang
Affiliations:
Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071;Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071;Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071;Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071
Venue:
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2009

Citing 13
Cited 2

RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications)

Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications)
The RAID-6 liberation codes

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Low-density MDS codes and factors of complete graphs

IEEE Transactions on Information Theory

A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Exploring reliability of exascale systems through simulations

Proceedings of the High Performance Computing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based double-erasure codes - RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for MPI programs.