Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

Authors:
Cristiana Bolchini;Matteo Carminati;Antonio Miele
Affiliations:
Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Milano, Italy 20133;Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Milano, Italy 20133;Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Milano, Italy 20133
Venue:
Journal of Electronic Testing: Theory and Applications
Year:
2013

Citing 12
Cited 0

Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The Vision of Autonomic Computing

Computer
The Reliability of FPGA Circuit Designs in the Presence of Radiation Induced Configuration Upsets

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Mixed-mode multicore reliability

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Self-adaptive software: Landscape and research challenges

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Performance modeling of parallel applications on MPSoCs

SOC'09 Proceedings of the 11th international conference on System-on-chip
Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A Fault Detection and Recovery Architecture for a Teradevice Dataflow System

DFM '11 Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing
System Adaptivity and Fault-Tolerance in NoC-based MPSoCs: The MADNESS Project Approach

DSD '12 Proceedings of the 2012 15th Euromicro Conference on Digital System Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.