Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

  • Authors:
  • Kypros Constantinides;Onur Mutlu;Todd Austin;Valeria Bertacco

  • Affiliations:
  • -;-;-;-

  • Venue:
  • Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such de- fects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to detect and to tolerate them while preserving the integrity of software applications running on the system. This paper proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instruc- tions, called Access-Control Extension (ACE), that can access and control the microprocessor's internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hard- ware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfigura- tion. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade off performance with reliability without requiring any change to the hardware. We evaluated our technique on a commercial chip-multiprocessor based on Sun's Niagara and found that it can provide very high coverage, with 99.22% of all silicon defects detected. Moreover, our results show that the average performance overhead of software- based testing is only 5.5%. Based on a detailed RTL-level imple- mentation of our technique, we find its area overhead to be quite modest, with only a 5.8% increase in total chip area.