DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Asim: A Performance Model Framework
Computer
Reconsidering Complex Branch Predictors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploiting Microarchitectural Redundancy For Defect Tolerance
ICCD '03 Proceedings of the 21st International Conference on Computer Design
Tolerating Hard Faults in Microprocessor Array Structures
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Rescue: A Microarchitecture for Testability and Defect Tolerance
Proceedings of the 32nd annual international symposium on Computer Architecture
Exploiting Structural Duplication for Lifetime Reliability Enhancement
Proceedings of the 32nd annual international symposium on Computer Architecture
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
BlackJack: Hard Error Detection with Redundant Threads on SMT
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Necromancer: enhancing system throughput by animating dead cores
Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Erasing Core Boundaries for Robust and Configurable Performance
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Exploring circuit timing-aware language and compilation
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Deadlock-free fine-grained thread migration
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
ROSY: recovering processor and memory systems from hard errors
ACM SIGOPS Operating Systems Review
Reliable computing with ultra-reduced instruction set co-processors
Proceedings of the 49th Annual Design Automation Conference
Viper: virtual pipelines for enhanced reliability
Proceedings of the 39th Annual International Symposium on Computer Architecture
Heuristic search for adaptive, defect-tolerant multiprocessor arrays
ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Deconfigurable microprocessor architectures for silicon debug acceleration
Proceedings of the 40th Annual International Symposium on Computer Architecture
Exploiting program-level masking and error propagation for constrained reliability optimization
Proceedings of the 50th Annual Design Automation Conference
A block-asynchronous relaxation method for graphics processing units
Journal of Parallel and Distributed Computing
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Use it or lose it: wear-out and lifetime in future chip multiprocessors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
DFTS: A dynamic fault-tolerant scheduling for real-time tasks in multicore processors
Microprocessors & Microsystems
Hi-index | 0.01 |
The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.