Algorithm-based recovery for iterative methods without checkpointing

Authors:
Zizhong Chen
Affiliations:
Colorado School of Mines, Golden, CO, USA
Venue:
Proceedings of the 20th international symposium on High performance distributed computing
Year:
2011

Citing 35
Cited 6

A Linear Algebraic Model of Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
Algorithmic fault tolerance using the Lanczos method

SIAM Journal on Matrix Analysis and Applications
Algorithm-Based Fault Tolerant Synthesis for Linear Operations

IEEE Transactions on Computers
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Efficient management of parallelism in object-oriented numerical software libraries

Modern software tools for scientific computing
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fault-tolerant matrix operations for parallel and distributed systems

Fault-tolerant matrix operations for parallel and distributed systems
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
An overview of the Trilinos project

ACM Transactions on Mathematical Software (TOMS) - Special issue on the Advanced CompuTational Software (ACTS) Collection
Condition Numbers of Gaussian Random Matrices

SIAM Journal on Matrix Analysis and Applications
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Scalable techniques for fault tolerant high performance computing

Scalable techniques for fault tolerant high performance computing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing
Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Reliable, scalable tree-based overlay networks

Reliable, scalable tree-based overlay networks
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

IEEE Transactions on Computers
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Numerically stable real number codes based on random matrices

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Robust distributed orthogonalization based on randomized aggregation

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
kMemvisor: flexible system wide memory mirroring in virtual environments

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisfies certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a sufficient condition for recovering the critical data without checkpointing. When this sufficient condition is satisfied, neither checkpoint nor roll-back is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced only when an actual failure occurs. Experimental results demonstrate that, when it works, the proposed scheme introduces much less overhead than checkpointing on the current world's eighth-fastest supercomputer Kraken.