Autonomic fault mitigation in embedded systems

  • Authors:
  • Sandeep Neema;Ted Bapty;Shweta Shetty;Steven Nordstrom

  • Affiliations:
  • Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA;Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA;Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA;Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA

  • Venue:
  • Engineering Applications of Artificial Intelligence
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Autonomy, particularly from a maintenance and fault-management perspective, is an increasingly desirable feature in embedded (and non-embedded) computer systems. The driving factors are several-including increasing pervasiveness of computer systems, cost of failures which could potentially be catastrophic in a wide variety of critical systems, and increasing cost and strain on resources in maintaining systems. A trigger system employed in real-time filtering of particle-collision data is a particularly challenging example of a class of large-scale real-time embedded systems that demand a high degree of fault resilience, due to the large cost of operating the facilities and the potential for loss of irreplaceable data. Traditional redundancy-based approaches are not available due to the limited fault-tolerance budget above the system cost. This paper presents an approach based on model integrated computing that provides a set of tools for the system developer to specify, simulate, and synthesize autonomous fault-mitigative behaviors. A hierarchical, role-based organization of fault managers cleanly delineates the data-processing interactions in the system from the fault-mitigative control interactions. The fault-mitigative behaviors, analogous to autonomous biological systems, are characterized as (1) reflex actions-highly autonomous, localized, and uncoordinated response emanating from a single fault manager at any level of hierarchy, and (2) healing actions-highly coordinated behavior implemented with a sequence of interactions between multiple fault managers. The strength of the approach lies in the specification of these behaviors as coordinated interacting hierarchical concurrent finite-state machines, which makes these behaviors formally analyzable.