Communications of the ACM - Special section on computer architecture
Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
Solving problems on concurrent processors
Solving problems on concurrent processors
Algorithm-Based Fault Detection for Signal Processing Applications
IEEE Transactions on Computers
A reconfigurable and fault-tolerant VLSI multiprocessor array
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Fault-secure algorithms for multiple-processor systems
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
IEEE Transactions on Software Engineering
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors
IEEE Transactions on Computers
Optimal Polling in Communication Networks
IEEE Transactions on Parallel and Distributed Systems
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems
IEEE Transactions on Computers
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.01 |
This paper addresses the issue of fault tolerance in hypercube architectures. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based fault detection. The basic idea used is to propose low cost fault detection and location schemes using high-level encodings on the data that are tailored to the algorithms being executed on the parallel machine. We have implemented system-level fault detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. In this paper we discuss the results of one such application, namely, Gaussian Elimination which is an extremely useful algorithm for solving a set of linear equations. We have performed extensive studies of fault coverage of our system level fault detection schemes in the presence of finite precision arithmetic which affects our system level encodings.