Recent Advances and New Avenues in Hardware-Level Reliability Support

  • Authors:
  • Ravishankar K. Iyer;Nithin M. Nakka;Zbigniew T. Kalbarczyk;Subhasish Mitra

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;Intel

  • Venue:
  • IEEE Micro
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The issue of transient (or soft) errors is one of the major concerns in designing and implementing the current generation of highly integrated digital systems. The continuous pushing of the processor performance envelope and the deployment of computer systems in complex mission- and life-critical applications has further increased the significance and impact of transient errors. In hardware, these errors have been handled at the device, circuit and architectural-level employing information redundancy, space redundancy, time redundancy or a combination of them. This paper analyzes techniques developed at the circuit- and the architectural-level, both in experimental academic research and industry. Based on past studies an observation is made that most low-level errors do not translate to errors in the outcome of the application, which is the primary concern of the user. Therefore, an alternative paradigm called application-aware runtime checking is proposed. In this approach the application is analyzed either statically or through dynamic profiling to extract its reliability-sensitive characteristics. Based on extracted application properties, hardware checkers/modules are devised and embedded in a processor-level framework to enable runtime error detection and recovery. The architecture of the Illinois Reliability and Security Engine is presented as a possible implementation of such a framework.