Adapting to intermittent faults in multicore systems

Authors:
Philip M. Wells;Koushik Chakraborty;Gurindar S. Sohi
Affiliations:
University of Wisconsin-Madison, Madison, WI;University of Wisconsin-Madison, Madison, WI;University of Wisconsin-Madison, Madison, WI
Venue:
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Year:
2008

Citing 34
Cited 7

Intermittent Fault Diagnosis in Multiprocessor Systems

IEEE Transactions on Computers
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
High-level synthesis of recoverable VLSI microarchitectures

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Cellular disco: resource management using virtual clusters on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Simics: A Full System Simulation Platform

Computer
IBM's S/390 G5 Microprocessor Design

IEEE Micro
Transient and Intermittent Fault Recovery without Rollback

DFT '98 Proceedings of the 13th International Symposium on Defect and Fault-Tolerance in VLSI Systems
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
Control Techniques to Eliminate Voltage Emergencies in High Performance Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Temperature-aware microarchitecture

Proceedings of the 30th annual international symposium on Computer architecture
Phase tracking and prediction

Proceedings of the 30th annual international symposium on Computer architecture
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Diagnosis of Intermittent Faults

Discrete Event Dynamic Systems
Design and reliability challenges in nanometer technologies

Proceedings of the 41st annual Design Automation Conference
Exploiting Resonant Behavior to Reduce Inductive Noise

Proceedings of the 31st annual international symposium on Computer architecture
Heat-and-run: leveraging SMT and CMP to manage power density through the operating system

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Microarchitecture and Design Challenges for Gigascale Integration

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Intel Virtualization Technology

Computer
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
NonStop® Advanced Architecture

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Spin Detection Hardware for Improved Management of Multithreaded Systems

IEEE Transactions on Parallel and Distributed Systems
Running Quake II on a grid

IBM Systems Journal
Advanced virtualization capabilities of POWER5 systems

IBM Journal of Research and Development - POWER5 and packaging
Hardware support for spin management in overcommitted virtual machines

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Ultra low-cost defect protection for microprocessor pipelines

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Mitigating the Impact of Process Variations on Processor Register Files and Execution Units

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Towards scalable multiprocessor virtual machines

VM'04 Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing

Mixed-mode multicore reliability

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Dynamic heterogeneity and the need for multicore virtualization

ACM SIGOPS Operating Systems Review
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults

Proceedings of the Conference on Design, Automation and Test in Europe
Characterizing the impact of using spare-cores on application performance

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
OpenMP parallelization of a mickens time-integration scheme for a mixed-culture biofilm model and its performance on multi-core and multi-processor computers

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several cycles to several seconds or more. Due to practical limitations of circuit techniques, cost-effective reliability will likely require the ability to temporarily suspend execution on a core during periods of intermittent faults. We investigate three of the most obvious techniques for adapting to the dynamically changing resource availability caused by intermittent faults, and demonstrate their different system-level implications. We show that system software reconfiguration has very high overhead, that temporarily pausing execution on a faulty core can lead to cascading livelock, and that using spare cores has high fault-free cost. To remedy these and other drawbacks of the three baseline techniques, we propose using a thin hardware/firmware layer to manage an overcommitted system -- one where the OS is configured to use more virtual processors than the number of currently available physical cores. We show that this proposed technique can gracefully degrade performance during intermittent faults of various duration with low overhead, without involving system software, and without requiring spare cores.