A novel approach to system-level fault tolerance in hypercube multiprocessors

Authors:
P. Banerjee;C. B. Stunkel
Affiliations:
Computer Systems Group, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign;Computer Systems Group, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Venue:
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Year:
1988

Citing 8
Cited 5

The cosmic cube

Communications of the ACM - Special section on computer architecture
Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems

IEEE Transactions on Computers - The MIT Press scientific computation series
Fault Tolerance Techniques for Array Structures Used in Supercomputing

Computer
Fault Tolerance Techniques for Systolic Arrays

Computer
Solving problems on concurrent processors

Solving problems on concurrent processors
Algorithm-Based Fault Detection for Signal Processing Applications

IEEE Transactions on Computers
A reconfigurable and fault-tolerant VLSI multiprocessor array

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Fault-secure algorithms for multiple-processor systems

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture

Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors

IEEE Transactions on Software Engineering
Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors

IEEE Transactions on Computers
Optimal Polling in Communication Networks

IEEE Transactions on Parallel and Distributed Systems
Optimal Design of Checks for Error Detection and Location in Fault-Tolerant Multiprocessor Systems

IEEE Transactions on Computers
Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper addresses the issue of fault tolerance in hypercube architectures. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based fault detection. The basic idea used is to propose low cost fault detection and location schemes using high-level encodings on the data that are tailored to the algorithms being executed on the parallel machine. We have implemented system-level fault detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. In this paper we discuss the results of one such application, namely, Gaussian Elimination which is an extremely useful algorithm for solving a set of linear equations. We have performed extensive studies of fault coverage of our system level fault detection schemes in the presence of finite precision arithmetic which affects our system level encodings.