Component Based Design of Multitolerant Systems
IEEE Transactions on Software Engineering
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Towards architecture-based self-healing systems
WOSS '02 Proceedings of the first workshop on Self-healing systems
Design Method for Conceptual Design of By-Wire Control: Two Case Studies
ICECCS '01 Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems
Component based design of fault-tolerance
Component based design of fault-tolerance
Dependability through Assured Reconfiguration in Embedded System Software
IEEE Transactions on Dependable and Secure Computing
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Incorporating fault tolerance tactics in software architecture patterns
Proceedings of the 2008 RISE/EFTS Joint International Workshop on Software Engineering for Resilient Systems
An evolving hierarchical & modular approach to resilient software
Proceedings of the 2008 RISE/EFTS Joint International Workshop on Software Engineering for Resilient Systems
Non-disruptive large-scale component updates for real-time controllers
ICDEW '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering Workshops
Automated application of fault tolerance mechanisms in a component-based system
Proceedings of the 9th International Workshop on Java Technologies for Real-Time and Embedded Systems
FASA: a scalable software framework for distributed control systems
Proceedings of the 3rd international ACM SIGSOFT symposium on Architecting Critical Systems
Hi-index | 0.00 |
To guarantee high availability, automation systems must be fault-tolerant. To this end, they must provide redundant solutions for the critical parts of the system. Classical fault tolerance patterns such as standby or N-modular redundancy provide system stability in the case of a fault. Fault tolerance is subsequently degraded or, depending on the number of deployed replicas, often even unavailable until the system has been repaired. We introduce a combination of a component-based framework, redundancy patterns, and a runtime manager, which is able to provide fault tolerance, to detect host failures, and to trigger a reconfiguration of the system at runtime. This combined solution maintains system operation in case a fault occurs and automatically restores fault tolerance. The proposed solution is validated using a case study of an industrial distributed automation system. The validation shows how our solution quickly restores fault tolerance without the need for operator intervention or immediate hardware replacement while limiting the impact on other applications.