Fault tolerant matrix-matrix multiplication: correcting soft errors on-line

  • Authors:
  • Panruo Wu;Chong Ding;Longxiang Chen;Feng Gao;Teresa Davies;Christer Karlsson;Zizhong Chen

  • Affiliations:
  • Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA

  • Venue:
  • Proceedings of the second workshop on Scalable algorithms for large-scale systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().