IEEE Transactions on Computers
A Fault-Tolerant FFT Processor
IEEE Transactions on Computers
The algebraic eigenvalue problem
The algebraic eigenvalue problem
Introduction to Parallel & Vector Solution of Linear Systems
Introduction to Parallel & Vector Solution of Linear Systems
IEEE Transactions on Software Engineering
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
The analysis and synthesis of efficient algorithm-based error detection schemes for hypercube multiprocessors
Scientific computing: an introduction with parallel computing
Scientific computing: an introduction with parallel computing
A New Error Analysis Based Method for Tolerance Computation for Algorithm-Based Checks
IEEE Transactions on Computers
Reliable Distributed Sorting Through the Application-Oriented Fault Tolerance Paradigm
IEEE Transactions on Parallel and Distributed Systems
Fault-tolerant algorithms for multiple processor systems
Fault-tolerant algorithms for multiple processor systems
Optimal Algorithms for Well-Conditioned Nonlinear Systems of Equations
IEEE Transactions on Computers
An Algorithm-Based Error Detection Scheme for the Multigrid Method
IEEE Transactions on Computers
Fault resilience of the algebraic multi-grid solver
Proceedings of the 26th ACM international conference on Supercomputing
Hi-index | 14.99 |
Algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. Algorithm-based schemes have been proposed for a wide variety of numerical applications. However, for a particular class of numerical applications, namely those involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. In this paper, we first describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion; later in the paper we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage.We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor.The timing overheads of our error-detecting algorithms over the basic iterative algorithms involving no error detection decrease with increasing problem dimension and become small for large data sizes.