Application-aware diagnosis of runtime hardware faults

Authors:
Andrea Pellegrini;Valeria Bertacco
Affiliations:
University of Michigan;University of Michigan
Venue:
Proceedings of the International Conference on Computer-Aided Design
Year:
2010

Citing 12
Cited 2

Operational Profiles in Software-Reliability Engineering

IEEE Software
A scalable software-based self-test methodology for programmable processors

Proceedings of the 40th annual Design Automation Conference
The NAS Parallel Benchmark Kernels in MPL

The NAS Parallel Benchmark Kernels in MPL
Reliability Wearout Mechanisms in Advanced CMOS Technologies

Reliability Wearout Mechanisms in Advanced CMOS Technologies
Systematic software-based self-test for pipelined processors

Proceedings of the 43rd annual Design Automation Conference
Ultra low-cost defect protection for microprocessor pipelines

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Microprocessors in the era of terascale integration

Proceedings of the conference on Design, automation and test in Europe
Low-cost protection for SER upsets and silicon defects

Proceedings of the conference on Design, automation and test in Europe
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
CASP: concurrent autonomous chip self-test using stored test patterns

Proceedings of the conference on Design, automation and test in Europe
Adaptive online testing for efficient hard fault detection

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Fault-based attack of RSA authentication

Proceedings of the Conference on Design, Automation and Test in Europe

Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
A survey of checker architectures

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extreme technology scaling in silicon devices drastically affects reliability, particularly because of runtime failures induced by transistor wearout. Current online testing mechanisms focus on testing all components in a microprocessor, including hardware that has not been exercised, and thus have high performance penalties. We propose a hybrid hardware/software online testing solution where components that are heavily utilized by the software application are tested more thoroughly and frequently. Thus, our online testing approach focuses on the processor units that affect application correctness the most, and it achieves high coverage while incurring minimal performance overhead. We also introduce a new metric, Application-Aware Fault Coverage, measuring a test's capability to detect faults that might have corrupted the state or the output of an application. Test coverage is further improved through the insertion of observation points that augment the coverage of the testing system. By evaluating our technique on a Sun OpenSPARC T1, we show that our solution maintains high Application-Aware Fault Coverage while reducing the performance overhead of online testing by more than a factor of 2 when compared to solutions oblivious to application's behavior. Specifically, we found that our solution can achieve 95% fault coverage while maintaining a minimal performance overhead (1.3%) and area impact (0.4%).