Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations

Authors:
Amber Roy-Chowdhury;Nikolas Bellas;Prithviraj Banerjee
Affiliations:
-;-;-
Venue:
IEEE Transactions on Computers
Year:
1996

Citing 11
Cited 3

Fault-Tolerant FFT Networks

IEEE Transactions on Computers
A Fault-Tolerant FFT Processor

IEEE Transactions on Computers
The algebraic eigenvalue problem

The algebraic eigenvalue problem
Introduction to Parallel & Vector Solution of Linear Systems

Introduction to Parallel & Vector Solution of Linear Systems
Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors

IEEE Transactions on Software Engineering
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
The analysis and synthesis of efficient algorithm-based error detection schemes for hypercube multiprocessors

The analysis and synthesis of efficient algorithm-based error detection schemes for hypercube multiprocessors
Scientific computing: an introduction with parallel computing

Scientific computing: an introduction with parallel computing
A New Error Analysis Based Method for Tolerance Computation for Algorithm-Based Checks

IEEE Transactions on Computers
Reliable Distributed Sorting Through the Application-Oriented Fault Tolerance Paradigm

IEEE Transactions on Parallel and Distributed Systems
Fault-tolerant algorithms for multiple processor systems

Fault-tolerant algorithms for multiple processor systems

Optimal Algorithms for Well-Conditioned Nonlinear Systems of Equations

IEEE Transactions on Computers
An Algorithm-Based Error Detection Scheme for the Multigrid Method

IEEE Transactions on Computers
Fault resilience of the algebraic multi-grid solver

Proceedings of the 26th ACM international conference on Supercomputing

Quantified Score

Hi-index	14.99

Visualization

Abstract

Algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. Algorithm-based schemes have been proposed for a wide variety of numerical applications. However, for a particular class of numerical applications, namely those involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. In this paper, we first describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion; later in the paper we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage.We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor.The timing overheads of our error-detecting algorithms over the basic iterative algorithms involving no error detection decrease with increasing problem dimension and become small for large data sizes.