A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Eigenvalues and condition numbers of random matrices
SIAM Journal on Matrix Analysis and Applications
Real-Number Codes for Fault-Tolerant Matrix Operations on Processor Arrays
IEEE Transactions on Computers
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
Algorithmic fault tolerance using the Lanczos method
SIAM Journal on Matrix Analysis and Applications
Algorithm-Based Fault Tolerant Synthesis for Linear Operations
IEEE Transactions on Computers
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation
IEEE Transactions on Computers
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
IEEE Transactions on Parallel and Distributed Systems
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
PLAPACK: parallel linear algebra package design overview
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Algorithm-Based Fault Tolerance for FFT Networks
IEEE Transactions on Computers
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Fault-tolerant matrix operations for parallel and distributed systems
Fault-tolerant matrix operations for parallel and distributed systems
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Condition Numbers of Gaussian Random Matrices
SIAM Journal on Matrix Analysis and Applications
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable techniques for fault tolerant high performance computing
Scalable techniques for fault tolerant high performance computing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
SIAM Journal on Scientific Computing
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
IEEE Transactions on Computers
Numerically stable real number codes based on random matrices
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
IEEE Transactions on Information Theory
Constructing numerically stable real number codes using evolutionary computation
Proceedings of the 12th annual conference on Genetic and evolutionary computation
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correcting code. However, the floating-point representation of a real number in today's high performance computer architecture introduces round off errors which can be enlarged and cause the loss of precision of possibly all effective digits during recovery when the number of processors in the system is large. In this paper, we present a class of Reed-Solomon style real-number erasure correcting codes which have optimal numerical stability during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes.