Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

  • Authors:
  • Hao Chen;Chengmo Yang

  • Affiliations:
  • University of Delaware, Newark, DE, USA;University of Delaware, Newark, DE, USA

  • Venue:
  • Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

While the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme that receives a lot of research attention is duplication and checkpointing. However, most of the techniques in the category employ a blind strategy to compare instruction results, therefore not only generating large overhead in buffering and verifying these values, but also inducing unnecessary rollbacks to recover faults that will never influence subsequent execution. To tackle these issues, we introduce in this paper an approach that identifies the minimum set of instruction results for fault detection and checkpointing. For a given application, the proposed technique first identifies the control and data flow information of each execution hotspot, and then selects only the instruction results that either influence the final program results or are needed during re-execution as the comparison set. Our experimental studies demonstrate that the proposed hotspot-targeting technique is able to reduce nearly 88% of the comparison overhead and mask over 38% of the total injected faults of all the injected faults while at the same time delivering full fault coverage.