Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Architectural core salvaging in a multi-core processor for hard-error tolerance
Proceedings of the 36th annual international symposium on Computer architecture
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Testing is a difficult process that becomes more difficult with scaling. With smaller and faster devices, tolerance for errors shrinks and devices may act correctly under certain condition and not under others. As such, hard errors may exist but are only exercised by very specific machine state and signal pathways. Targeting these errors is difficult, and creating test cases that cover all machine states and pathways is not possible. In addition, new complications during burn-in may mean latent hard errors are not exposed in the fab and reach the customer before becoming active. To address this problem, we propose an architecture we call BlackJack that allows hard errors to be detected using redundant threads running on a single SMT core. This technique provides a safety-net that catches hard errors that were either latent during test or just not covered by the test cases at all. Like SRT, our technique works by executing redundant copies and verifying that their resulting machine states agree. Unlike SRT, BlackJack is able to achieve high hard error instruction coverage by executing redundant threads on different front and backend resources in the pipeline. We show that for a 15% performance penalty over SRT, BlackJack achieves 97% hard error instruction coverage compared to SRT's 35%.