ROSY: recovering processor and memory systems from hard errors

  • Authors:
  • Karthik Raghavan;V. Kamakoti

  • Affiliations:
  • Indian Institute of Technology Madras, Chennai, India;Indian Institute of Technology Madras, Chennai, India

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2012

Quantified Score

Hi-index 0.01

Visualization

Abstract

In the nanometer era, there has been a steady decline in the semiconductor chip manufacturing yield due to various contributing factors, such as wearout and defects due to complex processes. One of the strategies to alleviate this issue is to recover and use faulty hardware at gracefully degraded performance. A common, though naive, recovery strategy followed in the context of general purpose multicore systems is to disable the cores with faults and use only the fully functional cores. Such a coarse-granular solution is suboptimal, as the disabled cores would have many working modules which go un-utilized. The Resurrecting Operating SYstem (ROSY) presented in this paper is a step towards the development of an operating system that can work on faulty cores by adapting itself to hardware faults using software workarounds, and and utilize their working components. We consider many realistic fault models and present software workarounds for them. We have developed a framework which can be trivially plugged into a fullyfeatured x86 based OS kernel to demonstrate the feasibility of the proposed ideas. Performance evaluation using SPEC benchmarks and real-world applications show that the performance degradation of the depleted cores executing ROSY is on an average between 1.6x to 4x, depending on the fault type.