Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Authors:
Zizhong Chen;Jack Dongarra
Affiliations:
The University of Tennessee, Knoxville, Department of Computer Science, Knoxville, TN;The University of Tennessee, Knoxville, Department of Computer Science, Knoxville, TN and Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 11
Cited 14

Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors

IEEE Transactions on Computers
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
The Globus toolkit

The grid
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A bandwidth latency tradeoff for broadcast and reduction

Information Processing Letters
Fault-tolerant matrix operations for parallel and distributed systems

Fault-tolerant matrix operations for parallel and distributed systems
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers

Spread-spectrum computation

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
Application-specific fault tolerance via data access characterization

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Convergence analysis of evolutionary algorithms in the presence of crash-faults and cheaters

Computers & Mathematics with Applications
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating the feasibility of using memory content similarity to improve system resilience

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
A study of application-level recovery methods for transient network faults

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollbaek-reovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix eomputations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today's high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations. To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failied processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead.