Optimal real number codes for fault tolerant matrix operations

Authors:
Zizhong Chen
Affiliations:
Colorado School of Mines, Golden, CO
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 28
Cited 5

A Linear Algebraic Model of Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Eigenvalues and condition numbers of random matrices

SIAM Journal on Matrix Analysis and Applications
Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays

IEEE Transactions on Computers
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
Algorithmic fault tolerance using the Lanczos method

SIAM Journal on Matrix Analysis and Applications
Algorithm-Based Fault Tolerant Synthesis for Linear Operations

IEEE Transactions on Computers
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation

IEEE Transactions on Computers
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
PLAPACK: parallel linear algebra package design overview

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Algorithm-Based Fault Tolerance for FFT Networks

IEEE Transactions on Computers
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Fault-tolerant matrix operations for parallel and distributed systems

Fault-tolerant matrix operations for parallel and distributed systems
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Condition Numbers of Gaussian Random Matrices

SIAM Journal on Matrix Analysis and Applications
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable techniques for fault tolerant high performance computing

Scalable techniques for fault tolerant high performance computing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

IEEE Transactions on Computers
Numerically stable real number codes based on random matrices

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Compressed sensing

IEEE Transactions on Information Theory

Constructing numerically stable real number codes using evolutionary computation

Proceedings of the 12th annual conference on Genetic and evolutionary computation
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correcting code. However, the floating-point representation of a real number in today's high performance computer architecture introduces round off errors which can be enlarged and cause the loss of precision of possibly all effective digits during recovery when the number of processors in the system is large. In this paper, we present a class of Reed-Solomon style real-number erasure correcting codes which have optimal numerical stability during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes.