Cost-effective safety and fault localization using distributed temporal redundancy

Authors:
Brett H. Meyer;Benton H. Calhoun;John Lach;Kevin Skadron
Affiliations:
University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Year:
2011

Citing 21
Cited 0

Fault-Tolerant Computing: Fundamental Concepts

Computer
Understanding fault-tolerant distributed systems

Communications of the ACM
Assignment and Scheduling Communicating Periodic Tasks in Distributed Real-Time Systems

IEEE Transactions on Software Engineering
TGFF: task graphs for free

Proceedings of the 6th international workshop on Hardware/software codesign
COFTA: Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance

IEEE Transactions on Computers
MOCSYN: multiobjective core-based single-chip system synthesis

DATE '99 Proceedings of the conference on Design, automation and test in Europe
Analysis of Checkpointing for Real-Time Systems

Real-Time Systems
Fault-tolerant platforms for automotive safety-critical applications

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Reliability-Aware Co-Synthesis for Embedded Systems

ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
Truss: A Reliable, Scalable Server Architecture

IEEE Micro
Synthesis of fault-tolerant schedules with transparency/performance trade-offs for distributed embedded systems

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Scheduling of fault-tolerant embedded systems with soft and hard timing constraints

Proceedings of the conference on Design, automation and test in Europe
Synthesis of fault-tolerant embedded systems

Proceedings of the conference on Design, automation and test in Europe
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Towards scalable reliability frameworks for error prone CMPs

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors

PRDC '09 Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing
Scheduling for energy efficiency and fault tolerance in hard real-time systems

Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Transparent recovery from intermittent faults in time-triggered distributed systems

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cost pressure is driving vendors of safety-critical systems to integrate previously distributed systems. One natural approach we have previous introduced is On-Demand Redundancy (ODR), which allows safety-critical and non-critical tasks, traditionally isolated to limit interference, to execute on shared resources. Our prior work has shown that relaxed dedication (RD), one ODR strategy which allows non-critical tasks (NCTs) to execute on idle critical task resources (CTRs), significantly increases NCT throughput. Unfortunately, there are circumstances under which, in spite of this opportunity, it is difficult to effectively schedule NCTs. In this paper, we introduce distributed temporal redundancy (DTR), which allows critical tasks, which traditionally execute in lockstep, to execute asynchronously. In doing so, DTR increases scheduling flexibility, resulting in systems that achieve much closer to the optimal NCT throughput than with relaxed dedication alone; in one set of experiments, DTR schedules no less 93% of the theoretical NCT cycles across a variety of synthetic benchmarks, out- performing RD by over 11%, on average. Furthermore, by distributing all redundant tasks across different resources, triple-modular redundancy, and therefore fault localization, can be achieved. We demonstrate that this can be accomplished with little additional cost and complexity: in practice, relatively few DTR tasks are in flight simultaneously, limiting the additional buffering needed to support DTR.