Software-Implemented Fault Detection for High-Performance Space Applications
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
The Art of Error Correcting Coding
The Art of Error Correcting Coding
A transmission line fault locator based on Elman recurrent networks
Applied Soft Computing
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Checksum-based probabilistic transient-error compensation for linear digital systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Transactions on Signal Processing
International Journal of Critical Computer-Based Systems
Hi-index | 0.00 |
In this paper, the authors present a new approach to algorithm based fault tolerance ABFT for High Performance computing system. The Algorithm Based Fault Tolerance approach transforms a system that does not tolerate a specific type of fault, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways, the parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs, can apply convolution codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This paper proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.