Determining Redundancy Levels for Fault Tolerant Real-Time Systems
IEEE Transactions on Computers - Special issue on fault-tolerant computing
A Real-Time Primary-Backup Replication Service
IEEE Transactions on Parallel and Distributed Systems
ARMADA Middleware and Communication Services
Real-Time Systems
IEEE Transactions on Computers
IEEE Transactions on Knowledge and Data Engineering
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects
IEEE Transactions on Computers
A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults
IEEE Transactions on Computers
Enhancing real-time schedules to tolerate transient faults
RTSS '95 Proceedings of the 16th IEEE Real-Time Systems Symposium
Adaptive fault tolerance and graceful degradation under dynamic hard real-time scheduling
RTSS '97 Proceedings of the 18th IEEE Real-Time Systems Symposium
Dynamic resource migration for multiparty real-time communication
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Scalable Resource Allocation for Multi-Processor QoS Optimization
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Proactive Recovery in Distributed CORBA Applications
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
MEAD: support for Real-Time Fault-Tolerant CORBA: Research Articles
Concurrency and Computation: Practice & Experience - Foundations of Middleware Technologies
Task Partitioning with Replication upon Heterogeneous Multiprocessor Systems
RTAS '06 Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium
Real-Time Task Replication for Fault Tolerance in Identical Multiprocessor Systems
RTAS '07 Proceedings of the 13th IEEE Real Time and Embedded Technology and Applications Symposium
ISAS '07 Proceedings of the 4th international symposium on Service Availability
Utility-driven proactive management of availability in enterprise-scale information flows
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Middleware support for dynamic component updating
OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Transparent recovery from intermittent faults in time-triggered distributed systems
IEEE Transactions on Computers
TimeAdapt: timely execution of dynamic software reconfigurations
Proceedings of the 5th Middleware doctoral symposium
ICCOMP'09 Proceedings of the WSEAES 13th international conference on Computers
AIC'09 Proceedings of the 9th WSEAS international conference on Applied informatics and communications
WSEAS Transactions on Computers
Stheno, a real-time fault-tolerant P2P middleware platform for light-train systems
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
An important class of distributed real-time and embedded (DRE) applications consists of periodic soft real-time tasks. Timeliness and availability are essential requirements for the correct operation of these applications. Conventional solutions to these challenges tend to use non-adaptive and load-agnostic fault tolerance solutions within a real-time system, which often end up making ad hoc fault tolerance (e.g., failover targets) decisions that can further overload already strained resources. Potential adverse consequences of these ad hoc actions include excessive delays for real-time tasks and cascades of resource failures. This paper presents FLARe, which is a middleware that provides adaptive fault tolerance for DRE systems. FLARe's resource management infrastructure monitors various system metrics, including CPU utilization, and makes informed, load-aware, and adaptive decisions about the application's fault tolerance configurations (e.g., failover targets, physical placement of replicas). FLARe also employs decision making algorithms to adapt these decisions at runtime as faults occur and provides trade-offs between timeliness, availability, and performance as resources get overloaded, removed, or added.