Roving Emulation as a Fault Detection Mechanism
IEEE Transactions on Computers
The Comparison Approach to Multiprocessor Fault Diagnosis
IEEE Transactions on Computers
Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems
IEEE Transactions on Computers
Sequential Fault Occurrence and Reconfiguration in System Level Diagnosis
IEEE Transactions on Computers
SPNP: Stochastic Petri Net Package
PNPM '89 The Proceedings of the Third International Workshop on Petri Nets and Performance Models
Theory, Volume 1, Queueing Systems
Theory, Volume 1, Queueing Systems
Free performance and fault tolerance (extended abstract): using system idle capacity efficiently
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems
IEEE Transactions on Parallel and Distributed Systems
A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis
IEEE Transactions on Parallel and Distributed Systems
An Adaptive Scheme for Fault-Tolerant Scheduling of Soft Real-Time Tasks in Multiprocessor Systems
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Hardware-Software Co-Reliability in Field Reconfigurable Multi-Processor-Memory Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A New Fault-Tolerant Technique for Improving the Schedulability in Multiprocessor Real-time Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
REESE: A Method of Soft Error Detection in Microprocessors
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Efficient overloading techniques for primary-backup scheduling in real-time systems
Journal of Parallel and Distributed Computing
IEEE Transactions on Parallel and Distributed Systems
An adaptive scheme for fault-tolerant scheduling of soft real-time tasks in multiprocessor systems
Journal of Parallel and Distributed Computing
Towards Nanoelectronics Processor Architectures
Journal of Electronic Testing: Theory and Applications
Improving chip multiprocessor reliability through code replication
Computers and Electrical Engineering
Hi-index | 14.99 |
Several schemes for detecting faults at the processor level in a multiprocessor system have been discussed in the past. One such scheme [1] works by running secondary versions of jobs on the unused, or spare, processors of the system and uses the comparison approach [2] to detect faults. We build upon this scheme and propose three new multiprocessor allocation strategies that run a variable number of versions per job. These schemes permit on-line detection and, in many cases, location of faulty processors in a system with nominal degradation in its delay/throughput performance; these delays are limited chiefly to the delays associated with job preemptions.Two new metrics, the fault detection capability (FDC) and the fault location capability (FLC), are introduced to evaluate these schemes. Extensive simulation results are performed to obtain performance figures for the various schemes. Stochastic Petri Net models are also developed to obtain approximate performance results. The results show that these schemes utilize spare capacity more efficiently, thereby improving upon the fault detection and location capabilities of the system.