Autonomic fault mitigation in embedded systems

Authors:
Sandeep Neema;Ted Bapty;Shweta Shetty;Steven Nordstrom
Affiliations:
Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA;Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA;Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA;Institute for Software Integrated Systems, Vanderbilt University, 2015 Terrace Place, Nashville, TN 37235, USA
Venue:
Engineering Applications of Artificial Intelligence
Year:
2004

Citing 9
Cited 4

Statecharts: A visual formalism for complex systems

Science of Computer Programming
The Model Checker SPIN

IEEE Transactions on Software Engineering - Special issue on formal methods in software practice
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Toward Systematic Design of Fault-Tolerant Systems

Computer
Model-Checking for Real-Time Systems

FCT '95 Proceedings of the 10th International Symposium on Fundamentals of Computation Theory
Spin model checker, the: primer and reference manual

Spin model checker, the: primer and reference manual
Constraint-guided self-adaptation

IWSAS'01 Proceedings of the 2nd international conference on Self-adaptive software: applications
Metamodeling-rapid design and evolution of domain-specific modeling environments

ECBS'99 Proceedings of the 1999 IEEE conference on Engineering of computer-based systems

Dependability for high-tech systems: an industry-as-laboratory approach

Proceedings of the conference on Design, automation and test in Europe
Model-Based Run-Time Error Detection

Models in Software Engineering
Towards autonomic computing systems

Engineering Applications of Artificial Intelligence
How to learn from the resilience of Human-Machine Systems?

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Autonomy, particularly from a maintenance and fault-management perspective, is an increasingly desirable feature in embedded (and non-embedded) computer systems. The driving factors are several-including increasing pervasiveness of computer systems, cost of failures which could potentially be catastrophic in a wide variety of critical systems, and increasing cost and strain on resources in maintaining systems. A trigger system employed in real-time filtering of particle-collision data is a particularly challenging example of a class of large-scale real-time embedded systems that demand a high degree of fault resilience, due to the large cost of operating the facilities and the potential for loss of irreplaceable data. Traditional redundancy-based approaches are not available due to the limited fault-tolerance budget above the system cost. This paper presents an approach based on model integrated computing that provides a set of tools for the system developer to specify, simulate, and synthesize autonomous fault-mitigative behaviors. A hierarchical, role-based organization of fault managers cleanly delineates the data-processing interactions in the system from the fault-mitigative control interactions. The fault-mitigative behaviors, analogous to autonomous biological systems, are characterized as (1) reflex actions-highly autonomous, localized, and uncoordinated response emanating from a single fault manager at any level of hierarchy, and (2) healing actions-highly coordinated behavior implemented with a sequence of interactions between multiple fault managers. The strength of the approach lies in the specification of these behaviors as coordinated interacting hierarchical concurrent finite-state machines, which makes these behaviors formally analyzable.