Techniques to mitigate the effects of congenital faults in processors

  • Authors:
  • Josep Torrellas;Smruti R. Sarangi

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign

  • Venue:
  • Techniques to mitigate the effects of congenital faults in processors
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is getting increasingly difficult to verify processors and guarantee subsequent reliable operation. The complexity of processors is rapidly increasing with every new generation, leading to an increase in the number of design defects, ie. logical bugs in RTL. Simultaneously, with every new generation, process variations are making it tougher to ensure timing closure, leading to an increased susceptibility to timing faults. In this thesis we characterize and propose solutions to mitigate the effects of such congenital faults. Now, for RTL bugs, the problem is compounded by the fact that some of the bugs are not reproducible. Due to several sources of non-determinism in the system like interrupts and i/o, variable bus latencies, memory refresh, etc., it is getting tougher to reproduce failures due to design defects. We propose a mechanism, CADRE, which eliminates these sources of non-determinism with negligible performance overhead and modest storage overhead. Nonetheless, some bugs will still slip through into production versions. To recover from such bugs, we propose dynamic on-chip reconfigurable logic, Phoenix, which can download bug signatures, and detect and recover from them. Phoenix has 0.05% area overhead and 0.48% wiring overhead. To redress the problem of process variation we propose a model of how parameter variation affects timing errors. The model successfully predicts the probability of timing errors under different process and environmental conditions for both SRAM and logic units. We verify this model with experimental data obtained in prior work. Using this model we introduce a novel framework that shows how microarchitecture techniques can mitigate variation-induced errors and even trade them off for power and processor frequency. Several such techniques are analyzed—in particular, a high-dimensional dynamic-adaptation technique that maximizes performance when there is slack in variation-induced error rate and power. The results show that our best combination of techniques increases processor frequency by 61% on average, allowing the processor to cycle 26% faster than without variation. Processor performance increases by 44% on average, resulting in a performance that is 18% higher than without variation—at only a 12.5% area cost.