UPPAAL—a tool suite for automatic verification of real-time systems
Proceedings of the DIMACS/SYCON workshop on Hybrid systems III : verification and control: verification and control
Overview of the CORBA component model
Component-based software engineering
Software fault tolerance techniques and implementation
Software fault tolerance techniques and implementation
Software Fault Tolerance
Towards architecture-based self-healing systems
WOSS '02 Proceedings of the first workshop on Self-healing systems
WOSS '02 Proceedings of the first workshop on Self-healing systems
Software Fault Tolerance: A Tutorial
Software Fault Tolerance: A Tutorial
Model-based programming of fault-aware systems
AI Magazine
Specifying adaptation semantics
WADS '05 Proceedings of the 2005 workshop on Architecting dependable systems
Passive mid-stream monitoring of real-time properties
Proceedings of the 5th ACM international conference on Embedded software
RT-MaC: Runtime Monitoring and Checking of Quantitative and Probabilistic Properties
RTCSA '05 Proceedings of the 11th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
Automatic recovery from software failure
Communications of the ACM - Self managed systems
Model-based development of dynamically adaptive software
Proceedings of the 28th international conference on Software engineering
Software Reliability Engineering: A Roadmap
FOSE '07 2007 Future of Software Engineering
Software Engineering for Self-Adaptive Systems: A Research Roadmap
Software Engineering for Self-Adaptive Systems
Increasing system dependability through architecture-based self-repair
Architecting dependable systems
Towards robust CNF encodings of cardinality constraints
CP'07 Proceedings of the 13th international conference on Principles and practice of constraint programming
TACAS'08/ETAPS'08 Proceedings of the Theory and practice of software, 14th international conference on Tools and algorithms for the construction and analysis of systems
Copilot: a hard real-time runtime monitor
RV'10 Proceedings of the First international conference on Runtime verification
Who guards the guardians?: toward v&v of health management software
RV'10 Proceedings of the First international conference on Runtime verification
Application of software health management techniques
Proceedings of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing Systems
Model-based software health management for real-time systems
AERO '11 Proceedings of the 2011 IEEE Aerospace Conference
The Case for Software Health Management
SMC-IT '11 Proceedings of the 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology
A component model for hard real-time systems: CCM with ARINC-653
Software—Practice & Experience
Runtime verification of traces under recording uncertainty
RV'11 Proceedings of the Second international conference on Runtime verification
Hi-index | 0.00 |
Rising software complexity in aerospace systems makes them very difficult to analyze and prepare for all possible fault scenarios at design time; therefore, classical run-time fault tolerance techniques such as self-checking pairs and triple modular redundancy are used. However, several recent incidents have made it clear that existing software fault tolerance techniques alone are not sufficient. To improve system dependability, simpler, yet formally specified and verified run-time monitoring, diagnosis, and fault mitigation capabilities are needed. Such architectures are already in use for managing the health of vehicles and systems. Software health management is the application of these techniques to software systems. In this paper, we briefly describe the software health management techniques and architecture developed by our research group. The foundation of the architecture is a real-time component framework (built upon ARINC-653 platform services) that defines a model of computation for software components. Dedicated architectural elements: the Component Level Health Manager (CLHM) and System Level Health Manager (SLHM) provide the health management services: anomaly detection, fault source isolation, and fault mitigation. The SLHM includes a diagnosis engine that (1) uses a Timed Failure Propagation Graph (TFPG) model derived from the component assembly model, (2) reasons about cascading fault effects in the system, and (3) isolates the fault source component(s). Thereafter, the appropriate system-level mitigation action is taken. The main focus of this article is the description of the fault mitigation architecture that uses goal-based deliberative reasoning to determine the best mitigation actions for recovering the system from the identified failure mode.