Lifetime reliability aware microprocessors

  • Authors:
  • Sarita Adve;Jayanth Srinivasan

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign

  • Venue:
  • Lifetime reliability aware microprocessors
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Ensuring long-term, or "lifetime" reliability, as dictated by the hard error rate due to wear-out based failures, is a critical requirement for microprocessor manufacturers. At the same time, the steady increases in CMOS processor performance have been driven by aggressive device scaling. This continuous scaling coupled with increasing temperatures on chip are making lifetime reliability targets increasingly difficult to meet. This dissertation addresses lifetime reliability issues from a microarchitectural perspective. Our key contributions include (i) the first architecture-level methodology for evaluating lifetime reliability, as a function of application behavior, (ii) a quantification of the impact of device scaling on lifetime reliability, taking workload characteristics into consideration, and (iii) performance and cost-effective architectural solutions targeted at enhancing lifetime reliability. The first part of this dissertation focuses on the design of tools and models to evaluate processor lifetime reliability. Using industrial strength models for lifetime reliability modes, we develop a methodology, called RAMP, to estimate lifetime reliability from an architectural and application perspective. We propose two implementations of RAMP, RAMP 1.0 and RAMP 2.0, which differ in their utility and accuracy. This dissertation also extends the RAMP methodology by adding scaling models for different technology generations to its failure mechanisms. Our quantification of the impact of scaling on a contemporary superscalar processor shows that device scaling has a significant detrimental impact on processor hard failure rates. The second part of this dissertation examines a range of microarchitectural techniques for lifetime reliability enhancement. In contrast to previous application-oblivious methods, these techniques allows processor designers to trade-off cost, performance, and reliability in an application-aware fashion. First, we propose dynamic reliability management (DRM) where the processor uses adaptive hardware to dynamically respond to changing application behavior to maintain its lifetime reliability target. Our results show that DRM enables the processor to extract significant performance benefit for a spectrum of reliability design costs. Next, we study two techniques that leverage microarchitectural structural redundancy for lifetime reliability enhancement. Structural redundancy has the potential to be more cost and performance effective than traditional processor redundancy. In structural duplication, redundant microarchitectural structures are added to the processor and designated as spares. Spare structures can be turned on when the original structure fails, increasing the processor's lifetime. Graceful processor degradation is a technique that exploits existing microarchitectural redundancy for reliability. Redundant structures that fail are shut down while still maintaining functionality, thereby increasing the processor's lifetime, but at a lower performance. Our evaluation shows significant reliability benefit from these techniques for a range of cost and performance budgets. Overall, this dissertation lays the basic foundation for microarchitectural analysis of lifetime reliability and provides new tools and techniques to handle this critical emerging technology challenge.