Analysis and Evaluation of a New Algorithm Based Fault Tolerance for Computing Systems

  • Authors:
  • Hodjat Hamidi;Abbas Vafaei;Seyed Amir Hassan Monadjemi

  • Affiliations:
  • University of Isfahan, Iran;University of Isfahan, Iran;University of Isfahan, Iran

  • Venue:
  • International Journal of Grid and High Performance Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, the authors present a new approach to algorithm based fault tolerance ABFT for High Performance computing system. The Algorithm Based Fault Tolerance approach transforms a system that does not tolerate a specific type of fault, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways, the parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs, can apply convolution codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This paper proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.