Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

  • Authors:
  • Cristiana Bolchini;Matteo Carminati;Antonio Miele

  • Affiliations:
  • Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Milano, Italy 20133;Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Milano, Italy 20133;Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Milano, Italy 20133

  • Venue:
  • Journal of Electronic Testing: Theory and Applications
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.