Intermittent Fault Diagnosis in Multiprocessor Systems
IEEE Transactions on Computers
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
High-level synthesis of recoverable VLSI microarchitectures
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Cellular disco: resource management using virtual clusters on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Transient and Intermittent Fault Recovery without Rollback
DFT '98 Proceedings of the 13th International Symposium on Defect and Fault-Tolerance in VLSI Systems
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Parameter variations and impact on circuits and microarchitecture
Proceedings of the 40th annual Design Automation Conference
Control Techniques to Eliminate Voltage Emergencies in High Performance Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Temperature-aware microarchitecture
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 30th annual international symposium on Computer architecture
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Diagnosis of Intermittent Faults
Discrete Event Dynamic Systems
Design and reliability challenges in nanometer technologies
Proceedings of the 41st annual Design Automation Conference
Exploiting Resonant Behavior to Reduce Inductive Noise
Proceedings of the 31st annual international symposium on Computer architecture
Heat-and-run: leveraging SMT and CMP to manage power density through the operating system
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Microarchitecture and Design Challenges for Gigascale Integration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Intel Virtualization Technology
Computer
Temporal Streaming of Shared Memory
Proceedings of the 32nd annual international symposium on Computer Architecture
NonStop® Advanced Architecture
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Spin Detection Hardware for Improved Management of Multithreaded Systems
IEEE Transactions on Parallel and Distributed Systems
IBM Systems Journal
Advanced virtualization capabilities of POWER5 systems
IBM Journal of Research and Development - POWER5 and packaging
Hardware support for spin management in overcommitted virtual machines
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Mitigating the Impact of Process Variations on Processor Register Files and Execution Units
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Towards scalable multiprocessor virtual machines
VM'04 Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3
Cooperative cache partitioning for chip multiprocessors
Proceedings of the 21st annual international conference on Supercomputing
Mixed-mode multicore reliability
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Dynamic heterogeneity and the need for multicore virtualization
ACM SIGOPS Operating Systems Review
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults
Proceedings of the Conference on Design, Automation and Test in Europe
Characterizing the impact of using spare-cores on application performance
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several cycles to several seconds or more. Due to practical limitations of circuit techniques, cost-effective reliability will likely require the ability to temporarily suspend execution on a core during periods of intermittent faults. We investigate three of the most obvious techniques for adapting to the dynamically changing resource availability caused by intermittent faults, and demonstrate their different system-level implications. We show that system software reconfiguration has very high overhead, that temporarily pausing execution on a faulty core can lead to cascading livelock, and that using spare cores has high fault-free cost. To remedy these and other drawbacks of the three baseline techniques, we propose using a thin hardware/firmware layer to manage an overcommitted system -- one where the OS is configured to use more virtual processors than the number of currently available physical cores. We show that this proposed technique can gracefully degrade performance during intermittent faults of various duration with low overhead, without involving system software, and without requiring spare cores.