Application-aware diagnosis of runtime hardware faults

  • Authors:
  • Andrea Pellegrini;Valeria Bertacco

  • Affiliations:
  • University of Michigan;University of Michigan

  • Venue:
  • Proceedings of the International Conference on Computer-Aided Design
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extreme technology scaling in silicon devices drastically affects reliability, particularly because of runtime failures induced by transistor wearout. Current online testing mechanisms focus on testing all components in a microprocessor, including hardware that has not been exercised, and thus have high performance penalties. We propose a hybrid hardware/software online testing solution where components that are heavily utilized by the software application are tested more thoroughly and frequently. Thus, our online testing approach focuses on the processor units that affect application correctness the most, and it achieves high coverage while incurring minimal performance overhead. We also introduce a new metric, Application-Aware Fault Coverage, measuring a test's capability to detect faults that might have corrupted the state or the output of an application. Test coverage is further improved through the insertion of observation points that augment the coverage of the testing system. By evaluating our technique on a Sun OpenSPARC T1, we show that our solution maintains high Application-Aware Fault Coverage while reducing the performance overhead of online testing by more than a factor of 2 when compared to solutions oblivious to application's behavior. Specifically, we found that our solution can achieve 95% fault coverage while maintaining a minimal performance overhead (1.3%) and area impact (0.4%).