Trade-offs in transient fault recovery schemes for redundant multithreaded processors

  • Authors:
  • Joseph Sharkey;Nayef Abu-Ghazeleh;Dmitry Ponomarev;Kanad Ghose;Aneesh Aggarwal

  • Affiliations:
  • Department of Computer Science;Department of Computer Science;Department of Computer Science;Department of Computer Science;Department of Electrical Engineering, State University of New York at Binghamton

  • Venue:
  • HiPC'06 Proceedings of the 13th international conference on High Performance Computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

CMOS downscaling trends, manifested in the use of smaller transistor feature sizes and lower supply voltages, make microprocessors more and more vulnerable to transient errors with each new technology generation. One architectural approach to detecting and recovering from such errors is to execute two copies of the same program and then compare the results. While comparing only the store instructions is sufficient for error detection, register values also have to be compared to support fault recovery. In this paper, we propose novel checkpoint-assisted mechanisms for efficient fault recovery that dramatically reduce the number of register values to be compared for detecting soft errors and perform comprehensive investigation of these and other existing recovery schemes from the standpoint of performance, power and design complexity.