Fault tolerant high performance computing by a coding approach

Authors:
Zizhong Chen;Graham E. Fagg;Edgar Gabriel;Julien Langou;Thara Angskun;George Bosilca;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN
Venue:
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2005

Citing 17
Cited 15

Eigenvalues and condition numbers of random matrices

SIAM Journal on Matrix Analysis and Applications
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
A Case for Two-Level Recovery Schemes

IEEE Transactions on Computers
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
On the Optimum Checkpoint Interval

Journal of the ACM (JACM)
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Fault-tolerant matrix operations for parallel and distributed systems

Fault-tolerant matrix operations for parallel and distributed systems
An Experimental Study about Diskless Checkpointing

EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
TOP500 supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Towards highly available and scalable high performance clusters

Journal of Computer and System Sciences
Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Implementing Reliable Data Structures for MPI Services in High Component Count Systems

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Selective Recovery from Failures in a Task Parallel Programming Model

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Parallel fault tolerant algorithms for parabolic problems

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Towards green computing using diskless high performance clusters

Proceedings of the 7th International Conference on Network and Services Management
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Multi-criteria checkpointing strategies: response-time versus resource utilization

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint.This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.