Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
The StageNet fabric for constructing resilient multicore systems
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Architectural core salvaging in a multi-core processor for hard-error tolerance
Proceedings of the 36th annual international symposium on Computer architecture
Operating system scheduling for efficient online self-test in robust systems
Proceedings of the 2009 International Conference on Computer-Aided Design
Adaptive online testing for efficient hard fault detection
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Using introspective software-based testing for post-silicon debug and repair
Proceedings of the 47th Design Automation Conference
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
A self-adaptive system architecture to address transistor aging
Proceedings of the Conference on Design, Automation and Test in Europe
Exploring circuit timing-aware language and compilation
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
ROSY: recovering processor and memory systems from hard errors
ACM SIGOPS Operating Systems Review
Reliable on-chip systems in the nano-era: lessons learnt and future trends
Proceedings of the 50th Annual Design Automation Conference
A survey of checker architectures
ACM Computing Surveys (CSUR)
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such de- fects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to detect and to tolerate them while preserving the integrity of software applications running on the system. This paper proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instruc- tions, called Access-Control Extension (ACE), that can access and control the microprocessor's internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hard- ware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfigura- tion. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade off performance with reliability without requiring any change to the hardware. We evaluated our technique on a commercial chip-multiprocessor based on Sun's Niagara and found that it can provide very high coverage, with 99.22% of all silicon defects detected. Moreover, our results show that the average performance overhead of software- based testing is only 5.5%. Based on a detailed RTL-level imple- mentation of our technique, we find its area overhead to be quite modest, with only a 5.8% increase in total chip area.