Relax: an architectural framework for software recovery of hardware faults

  • Authors:
  • Marc de Kruijf;Shuou Nomura;Karthikeyan Sankaralingam

  • Affiliations:
  • University of Wisconsin -- Madison, Madison, WI, USA;University of Wisconsin -- Madison, Madison, WI, USA;University of Wisconsin -- Madison, Madison, WI, USA

  • Venue:
  • Proceedings of the 37th annual international symposium on Computer architecture
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

As technology scales ever further, device unreliability is creating excessive complexity for hardware to maintain the illusion of perfect operation. In this paper, we consider whether exposing hardware fault information to software and allowing software to control fault recovery simplifies hardware design and helps technology scaling. The combination of emerging applications and emerging many-core architectures makes software recovery a viable alternative to hardware-based fault recovery. Emerging applications tend to have few I/O and memory side-effects, which limits the amount of information that needs checkpointing, and they allow discarding individual sub-computations with small qualitative impact. Software recovery can harness these properties in ways that hardware recovery cannot. We describe Relax, an architectural framework for software recovery of hardware faults. Relax includes three core components: (1) an ISA extension that allows software to mark regions of code for software recovery, (2) a hardware organization that simplifies reliability considerations and provides energy efficiency with hardware recovery support removed, and (3) software support for compilers and programmers to utilize the Relax ISA. Applying Relax to counter the effects of process variation, our results show a 20% energy efficiency improvement for PARSEC applications with only minimal source code changes and simpler hardware.