Architectural core salvaging in a multi-core processor for hard-error tolerance

  • Authors:
  • Michael D. Powell;Arijit Biswas;Shantanu Gupta;Shubhendu S. Mukherjee

  • Affiliations:
  • Intel Massachusetts, Hudson, MA, USA;Intel Massachusetts, Hudson, MA, USA;University of Michigan, Ann Arbor, MI, USA;Intel Massachusetts, Hudson, MA, USA

  • Venue:
  • Proceedings of the 36th annual international symposium on Computer architecture
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.