Design Optimization of Time-and Cost-Constrained Fault-Tolerant Distributed Embedded Systems
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Proceedings of the conference on Design, automation and test in Europe: Proceedings
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Proceedings of the 4th on Middleware doctoral symposium
Scheduling of fault-tolerant embedded systems with soft and hard timing constraints
Proceedings of the conference on Design, automation and test in Europe
Synthesis of fault-tolerant embedded systems
Proceedings of the conference on Design, automation and test in Europe
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Analysis and optimization of fault-tolerant embedded systems with hardened processors
Proceedings of the Conference on Design, Automation and Test in Europe
Combined architecture and hardening techniques exploration for reliable embedded system design
Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
Cost-effective safety and fault localization using distributed temporal redundancy
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A self-checking hardware journal for a fault-tolerant processor architecture
International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
ACM Transactions on Embedded Computing Systems (TECS)
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 14.98 |
The time-triggered model, with tasks scheduled in static (off line) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones.