A multilevel fault model for integrated parallel fault-tolerant systems

  • Authors:
  • Bernhard Fechner

  • Affiliations:
  • Department of Systems and Networking, University of Augsburg, Universitätsstr. 6a, 86159, Augsburg, Germany

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The appearance of multithreaded, multicore, and manycore systems has led to a performance leap. Such systems are denoted as integrated, when there are electrical and physical dependencies between different functional units, that is, multiple cores integrated on a single die. Typically, such systems have a common, shared interface to the outside world, bearing the potential of a single point of failure. In this work, several questions concerning fault propagation shall be tackled. First, if one component within a core fails, how likely is a faulty behavior of other components on the same or other cores? Second, what is the overall reliability of such a system? It is important to answer these questions prior to an implementation, because the total costs of a reliable product shall be as small as possible. Our approach combines different abstraction levels in one multilevel fault model. The first stage is the physical level, covering the physical effects of a fault. Validation on this level can be omitted, if the modeling is precise enough. The second stage is a component and routing model where current is represented as logic value. The last level is the behavioral modeling of components by finite state machines. Because of the different number and nature of existing parallel systems, a theoretical approach is followed. The model can cover the whole range of parallel devices from field programmable gate arrays to multicore CPUs and manycore graphics processing units. Therefore, it can help to improve the reliability of current and future parallel fault-tolerant systems by identifying the underlying bottlenecks. The function of the model is exemplarily shown by applying it to a field programmable gate array, identifying switchboxes as the main reliability bottleneck. Copyright © 2012 John Wiley & Sons, Ltd.