BlackJack: Hard Error Detection with Redundant Threads on SMT

  • Authors:
  • Ethan Schuchman;T. N. Vijaykumar

  • Affiliations:
  • Purdue University, USA;Purdue University, USA

  • Venue:
  • DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Testing is a difficult process that becomes more difficult with scaling. With smaller and faster devices, tolerance for errors shrinks and devices may act correctly under certain condition and not under others. As such, hard errors may exist but are only exercised by very specific machine state and signal pathways. Targeting these errors is difficult, and creating test cases that cover all machine states and pathways is not possible. In addition, new complications during burn-in may mean latent hard errors are not exposed in the fab and reach the customer before becoming active. To address this problem, we propose an architecture we call BlackJack that allows hard errors to be detected using redundant threads running on a single SMT core. This technique provides a safety-net that catches hard errors that were either latent during test or just not covered by the test cases at all. Like SRT, our technique works by executing redundant copies and verifying that their resulting machine states agree. Unlike SRT, BlackJack is able to achieve high hard error instruction coverage by executing redundant threads on different front and backend resources in the pipeline. We show that for a 15% performance penalty over SRT, BlackJack achieves 97% hard error instruction coverage compared to SRT's 35%.