ROSY: recovering processor and memory systems from hard errors

Authors:
Karthik Raghavan;V. Kamakoti
Affiliations:
Indian Institute of Technology Madras, Chennai, India;Indian Institute of Technology Madras, Chennai, India
Venue:
ACM SIGOPS Operating Systems Review
Year:
2012

Citing 14
Cited 0

DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Exploiting Microarchitectural Redundancy For Defect Tolerance

ICCD '03 Proceedings of the 21st International Conference on Computer Design
2003 Technology Roadmap for Semiconductors

Computer
Tolerating Hard Faults in Microprocessor Array Structures

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Software Engineering "Best Practices")

The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Software Engineering "Best Practices")
Rescue: A Microarchitecture for Testability and Defect Tolerance

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploiting Structural Duplication for Lifetime Reliability Enhancement

Proceedings of the 32nd annual international symposium on Computer Architecture
Online diagnosis of hard faults in microprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Architectural core salvaging in a multi-core processor for hard-error tolerance

Proceedings of the 36th annual international symposium on Computer architecture
Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors

IEEE Transactions on Computers
StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs

IEEE Transactions on Computers

Quantified Score

Hi-index	0.01

Visualization

Abstract

In the nanometer era, there has been a steady decline in the semiconductor chip manufacturing yield due to various contributing factors, such as wearout and defects due to complex processes. One of the strategies to alleviate this issue is to recover and use faulty hardware at gracefully degraded performance. A common, though naive, recovery strategy followed in the context of general purpose multicore systems is to disable the cores with faults and use only the fully functional cores. Such a coarse-granular solution is suboptimal, as the disabled cores would have many working modules which go un-utilized. The Resurrecting Operating SYstem (ROSY) presented in this paper is a step towards the development of an operating system that can work on faulty cores by adapting itself to hardware faults using software workarounds, and and utilize their working components. We consider many realistic fault models and present software workarounds for them. We have developed a framework which can be trivially plugged into a fullyfeatured x86 based OS kernel to demonstrate the feasibility of the proposed ideas. Performance evaluation using SPEC benchmarks and real-world applications show that the performance degradation of the depleted cores executing ROSY is on an average between 1.6x to 4x, depending on the fault type.