Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The Vision of Autonomic Computing
Computer
The Reliability of FPGA Circuit Designs in the Presence of Radiation Induced Configuration Upsets
FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Configurable isolation: building high availability systems with commodity multi-core processors
Proceedings of the 34th annual international symposium on Computer architecture
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Mixed-mode multicore reliability
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Self-adaptive software: Landscape and research challenges
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The multikernel: a new OS architecture for scalable multicore systems
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Performance modeling of parallel applications on MPSoCs
SOC'09 Proceedings of the 11th international conference on System-on-chip
Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A Fault Detection and Recovery Architecture for a Teradevice Dataflow System
DFM '11 Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing
System Adaptivity and Fault-Tolerance in NoC-based MPSoCs: The MADNESS Project Approach
DSD '12 Proceedings of the 2012 15th Euromicro Conference on Digital System Design
Hi-index | 0.00 |
This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.