A multilevel fault model for integrated parallel fault-tolerant systems

Authors:
Bernhard Fechner
Affiliations:
Department of Systems and Networking, University of Augsburg, Universitätsstr. 6a, 86159, Augsburg, Germany
Venue:
Concurrency and Computation: Practice & Experience
Year:
2012

Citing 11
Cited 0

High-speed digital design: a handbook of black magic

High-speed digital design: a handbook of black magic
New Methods for Evaluating the Impact of Single Event Transients in VDSM ICs

DFT '02 Proceedings of the 17th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems
Segment delay faults: a new fault model

VTS '96 Proceedings of the 14th IEEE VLSI Test Symposium
Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes

IEEE Transactions on Dependable and Secure Computing
Efficient Estimation of SEU Effects in SRAM-Based FPGAs

IOLTS '05 Proceedings of the 11th IEEE International On-Line Testing Symposium
Reliability Evaluation of Repairable/Reconfigurable FPGAs

DFT '06 Proceedings of the 21st IEEE International Symposium on on Defect and Fault-Tolerance in VLSI Systems
Impact of process variations on multicore performance symmetry

Proceedings of the conference on Design, automation and test in Europe
Evaluating fault-tolerant system designs using FAUmachine

Proceedings of the 2007 workshop on Engineering fault tolerant systems
Online Estimation of Architectural Vulnerability Factor for Soft Errors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Measuring Architectural Vulnerability Factors

IEEE Micro
Introduction to Discrete Event Systems

Introduction to Discrete Event Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The appearance of multithreaded, multicore, and manycore systems has led to a performance leap. Such systems are denoted as integrated, when there are electrical and physical dependencies between different functional units, that is, multiple cores integrated on a single die. Typically, such systems have a common, shared interface to the outside world, bearing the potential of a single point of failure. In this work, several questions concerning fault propagation shall be tackled. First, if one component within a core fails, how likely is a faulty behavior of other components on the same or other cores? Second, what is the overall reliability of such a system? It is important to answer these questions prior to an implementation, because the total costs of a reliable product shall be as small as possible. Our approach combines different abstraction levels in one multilevel fault model. The first stage is the physical level, covering the physical effects of a fault. Validation on this level can be omitted, if the modeling is precise enough. The second stage is a component and routing model where current is represented as logic value. The last level is the behavioral modeling of components by finite state machines. Because of the different number and nature of existing parallel systems, a theoretical approach is followed. The model can cover the whole range of parallel devices from field programmable gate arrays to multicore CPUs and manycore graphics processing units. Therefore, it can help to improve the reliability of current and future parallel fault-tolerant systems by identifying the underlying bottlenecks. The function of the model is exemplarily shown by applying it to a field programmable gate array, identifying switchboxes as the main reliability bottleneck. Copyright © 2012 John Wiley & Sons, Ltd.