Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

  • Authors:
  • Zizhong Chen

  • Affiliations:
  • University of California, Riverside, USA

  • Venue:
  • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.