The Hyeti Defect Tolerant Microprocessor: A Practical Experiment and its Cost-Effectiveness Analysis
IEEE Transactions on Computers
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Improved Yield Models for Fault-Tolerant Memory Chips
IEEE Transactions on Computers
Exploiting Microarchitectural Redundancy For Defect Tolerance
ICCD '03 Proceedings of the 21st International Conference on Computer Design
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Opportunities and challenges in application-tuned circuits and architectures based on nanodevices
Proceedings of the 1st conference on Computing frontiers
Tolerating Hard Faults in Microprocessor Array Structures
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Impact of Technology Scaling on Lifetime Reliability
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
An effective DFM strategy requires accurate process and IP pre-characterization
Proceedings of the 42nd annual Design Automation Conference
Exploiting Structural Duplication for Lifetime Reliability Enhancement
Proceedings of the 32nd annual international symposium on Computer Architecture
A Reconfiguration-Based Defect-Tolerant Design Paradigm for Nanotechnologies
IEEE Design & Test
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
POWER4 system microarchitecture
IBM Journal of Research and Development
Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology
IBM Journal of Research and Development
Reliability limits for the gate insulator in CMOS technology
IBM Journal of Research and Development
Reliability-aware core partitioning in chip multiprocessors
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
An increasing number of hardware failures can be attributed to device reliability problems that cause partial system failure or shutdown. In this paper we propose a scheme for improving reliability of a homogeneous chip multiprocessor (CMP) that also serves to improve manufacturing yield. Our solution centers on exploiting the natural redundancy that already exists in multi-core systems by using services from other cores for functional units that are defective in a faulty core. A micro-architectural modification allows a core on a CMP to use another core as a coprocessor to service any instruction that the former cannot execute correctly. This service is accessed to improve yield and reliability, but at the cost of some loss of performance. In order to quantify this loss we have used a cycle-accurate simulator to simulate the performance of a dual-core system with one or two cores sustaining partial failure. Our results indicate that when a large and sparingly-used unit such as a floating point arithmetic unit fails in a core, even for a floating point intensive benchmark, we can continue to run each faulty core with help from companion cores with as little as 10% impact to performance and less than 1% area overhead.