DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Exploiting Microarchitectural Redundancy For Defect Tolerance
ICCD '03 Proceedings of the 21st International Conference on Computer Design
Tolerating Hard Faults in Microprocessor Array Structures
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Software Engineering "Best Practices")
Rescue: A Microarchitecture for Testability and Defect Tolerance
Proceedings of the 32nd annual international symposium on Computer Architecture
Exploiting Structural Duplication for Lifetime Reliability Enhancement
Proceedings of the 32nd annual international symposium on Computer Architecture
Online diagnosis of hard faults in microprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Architectural core salvaging in a multi-core processor for hard-error tolerance
Proceedings of the 36th annual international symposium on Computer architecture
Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors
IEEE Transactions on Computers
StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs
IEEE Transactions on Computers
Hi-index | 0.01 |
In the nanometer era, there has been a steady decline in the semiconductor chip manufacturing yield due to various contributing factors, such as wearout and defects due to complex processes. One of the strategies to alleviate this issue is to recover and use faulty hardware at gracefully degraded performance. A common, though naive, recovery strategy followed in the context of general purpose multicore systems is to disable the cores with faults and use only the fully functional cores. Such a coarse-granular solution is suboptimal, as the disabled cores would have many working modules which go un-utilized. The Resurrecting Operating SYstem (ROSY) presented in this paper is a step towards the development of an operating system that can work on faulty cores by adapting itself to hardware faults using software workarounds, and and utilize their working components. We consider many realistic fault models and present software workarounds for them. We have developed a framework which can be trivially plugged into a fullyfeatured x86 based OS kernel to demonstrate the feasibility of the proposed ideas. Performance evaluation using SPEC benchmarks and real-world applications show that the performance degradation of the depleted cores executing ROSY is on an average between 1.6x to 4x, depending on the fault type.