The Case for Lifetime Reliability-Aware Microprocessors

  • Authors:
  • Jayanth Srinivasan;Sarita V. Adve;Pradip Bose;Jude A. Rivers

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY

  • Venue:
  • Proceedings of the 31st annual international symposium on Computer architecture
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Ensuring long processor lifetimes by limiting failuresdue to wear-out related hard errors is a critical requirementfor all microprocessor manufacturers. We observethat continuous device scaling and increasing temperaturesare making lifetime reliability targets even harder to meet.However, current methodologies for qualifying lifetime reliabilityare overly conservative since they assume worst-caseoperating conditions. This paper makes the case thatthe continued use of such methodologies will significantlyand unnecessarily constrain performance. Instead, lifetimereliability awareness at the microarchitectural design stagecan mitigate this problem, by designing processors that dynamicallyadapt in response to the observed usage to meeta reliability target.We make two specific contributions. First, we describean architecture-level model and its implementation, calledRAMP, that can dynamically track lifetime reliability, respondingto changes in application behavior. RAMP isbased on state-of-the-art device models for different wear-outmechanisms. Second, we propose dynamic reliabilitymanagement (DRM) - a technique where the processorcan respond to changing application behavior to maintainits lifetime reliability target. In contrast to currentworst-case behavior based reliability qualification methodologies,DRM allows processors to be qualified for reliabilityat lower (but more likely) operating points than theworst case. Using RAMP, we show that this can save costand/or improve performance, that dynamic voltage scalingis an effective response technique for DRM, and that dynamicthermal management neither subsumes nor is sub-sumedby DRM.