Architectural core salvaging in a multi-core processor for hard-error tolerance

Authors:
Michael D. Powell;Arijit Biswas;Shantanu Gupta;Shubhendu S. Mukherjee
Affiliations:
Intel Massachusetts, Hudson, MA, USA;Intel Massachusetts, Hudson, MA, USA;University of Michigan, Ann Arbor, MI, USA;Intel Massachusetts, Hudson, MA, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 11
Cited 19

DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Asim: A Performance Model Framework

Computer
Reconsidering Complex Branch Predictors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploiting Microarchitectural Redundancy For Defect Tolerance

ICCD '03 Proceedings of the 21st International Conference on Computer Design
Tolerating Hard Faults in Microprocessor Array Structures

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Rescue: A Microarchitecture for Testability and Defect Tolerance

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploiting Structural Duplication for Lifetime Reliability Enhancement

Proceedings of the 32nd annual international symposium on Computer Architecture
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
BlackJack: Hard Error Detection with Redundant Threads on SMT

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults

Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Necromancer: enhancing system throughput by animating dead cores

Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Design techniques for cross-layer resilience

Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Erasing Core Boundaries for Robust and Configurable Performance

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Exploring circuit timing-aware language and compilation

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Deadlock-free fine-grained thread migration

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
ROSY: recovering processor and memory systems from hard errors

ACM SIGOPS Operating Systems Review
Reliable computing with ultra-reduced instruction set co-processors

Proceedings of the 49th Annual Design Automation Conference
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Heuristic search for adaptive, defect-tolerant multiprocessor arrays

ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Deconfigurable microprocessor architectures for silicon debug acceleration

Proceedings of the 40th Annual International Symposium on Computer Architecture
Exploiting program-level masking and error propagation for constrained reliability optimization

Proceedings of the 50th Annual Design Automation Conference
A block-asynchronous relaxation method for graphics processing units

Journal of Parallel and Distributed Computing
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Use it or lose it: wear-out and lifetime in future chip multiprocessors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
DFTS: A dynamic fault-tolerant scheduling for real-time tasks in multicore processors

Microprocessors & Microsystems

Quantified Score

Hi-index	0.01

Visualization

Abstract

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.