Low Overhead Multiprocessor Allocation Strategies Exploiting System Spare Capacity for Fault Detection and Location

Authors:
Arun K. Somani;Srinivasan Tridandapani;Upender R. Sandadi
Affiliations:
-;-;-
Venue:
IEEE Transactions on Computers
Year:
1995

Citing 6
Cited 13

Roving Emulation as a Fault Detection Mechanism

IEEE Transactions on Computers
The Comparison Approach to Multiprocessor Fault Diagnosis

IEEE Transactions on Computers
Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems

IEEE Transactions on Computers
Sequential Fault Occurrence and Reconfiguration in System Level Diagnosis

IEEE Transactions on Computers
SPNP: Stochastic Petri Net Package

PNPM '89 The Proceedings of the Third International Workshop on Petri Nets and Performance Models
Theory, Volume 1, Queueing Systems

Theory, Volume 1, Queueing Systems

Free performance and fault tolerance (extended abstract): using system idle capacity efficiently

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis

IEEE Transactions on Parallel and Distributed Systems
An Adaptive Scheme for Fault-Tolerant Scheduling of Soft Real-Time Tasks in Multiprocessor Systems

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Hardware-Software Co-Reliability in Field Reconfigurable Multi-Processor-Memory Systems

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A New Fault-Tolerant Technique for Improving the Schedulability in Multiprocessor Real-time Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
REESE: A Method of Soft Error Detection in Microprocessors

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Scheduling Algorithms Exploiting Spare Capacity and Tasks' Laxities for Fault Detection and Location in Real-time Multiprocessor Systems

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Efficient overloading techniques for primary-backup scheduling in real-time systems

Journal of Parallel and Distributed Computing
Time-Constrained Failure Diagnosis in Distributed Embedded Systems: Application to Actuator Diagnosis

IEEE Transactions on Parallel and Distributed Systems
An adaptive scheme for fault-tolerant scheduling of soft real-time tasks in multiprocessor systems

Journal of Parallel and Distributed Computing
Towards Nanoelectronics Processor Architectures

Journal of Electronic Testing: Theory and Applications
Improving chip multiprocessor reliability through code replication

Computers and Electrical Engineering

Quantified Score

Hi-index	14.99

Visualization

Abstract

Several schemes for detecting faults at the processor level in a multiprocessor system have been discussed in the past. One such scheme [1] works by running secondary versions of jobs on the unused, or spare, processors of the system and uses the comparison approach [2] to detect faults. We build upon this scheme and propose three new multiprocessor allocation strategies that run a variable number of versions per job. These schemes permit on-line detection and, in many cases, location of faulty processors in a system with nominal degradation in its delay/throughput performance; these delays are limited chiefly to the delays associated with job preemptions.Two new metrics, the fault detection capability (FDC) and the fault location capability (FLC), are introduced to evaluate these schemes. Extensive simulation results are performed to obtain performance figures for the various schemes. Stochastic Petri Net models are also developed to obtain approximate performance results. The results show that these schemes utilize spare capacity more efficiently, thereby improving upon the fault detection and location capabilities of the system.