Understanding fault-tolerant distributed systems
Communications of the ACM
Assignment and Scheduling Communicating Periodic Tasks in Distributed Real-Time Systems
IEEE Transactions on Software Engineering
Proceedings of the 6th international workshop on Hardware/software codesign
IEEE Transactions on Computers
MOCSYN: multiobjective core-based single-chip system synthesis
DATE '99 Proceedings of the conference on Design, automation and test in Europe
Analysis of Checkpointing for Real-Time Systems
Real-Time Systems
Fault-tolerant platforms for automotive safety-critical applications
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Reliability-Aware Co-Synthesis for Embedded Systems
ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Scheduling of fault-tolerant embedded systems with soft and hard timing constraints
Proceedings of the conference on Design, automation and test in Europe
Synthesis of fault-tolerant embedded systems
Proceedings of the conference on Design, automation and test in Europe
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Towards scalable reliability frameworks for error prone CMPs
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
PRDC '09 Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing
Scheduling for energy efficiency and fault tolerance in hard real-time systems
Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Transparent recovery from intermittent faults in time-triggered distributed systems
IEEE Transactions on Computers
Hi-index | 0.00 |
Cost pressure is driving vendors of safety-critical systems to integrate previously distributed systems. One natural approach we have previous introduced is On-Demand Redundancy (ODR), which allows safety-critical and non-critical tasks, traditionally isolated to limit interference, to execute on shared resources. Our prior work has shown that relaxed dedication (RD), one ODR strategy which allows non-critical tasks (NCTs) to execute on idle critical task resources (CTRs), significantly increases NCT throughput. Unfortunately, there are circumstances under which, in spite of this opportunity, it is difficult to effectively schedule NCTs. In this paper, we introduce distributed temporal redundancy (DTR), which allows critical tasks, which traditionally execute in lockstep, to execute asynchronously. In doing so, DTR increases scheduling flexibility, resulting in systems that achieve much closer to the optimal NCT throughput than with relaxed dedication alone; in one set of experiments, DTR schedules no less 93% of the theoretical NCT cycles across a variety of synthetic benchmarks, out- performing RD by over 11%, on average. Furthermore, by distributing all redundant tasks across different resources, triple-modular redundancy, and therefore fault localization, can be achieved. We demonstrate that this can be accomplished with little additional cost and complexity: in practice, relatively few DTR tasks are in flight simultaneously, limiting the additional buffering needed to support DTR.