A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

Authors:
Chengmo Yang;Alex Orailoglu
Affiliations:
UC San Diego, San Diego, CA, USA;UC San Diego, San Diego, CA, USA
Venue:
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Year:
2008

Citing 13
Cited 3

Virtual Checkpoints: Architecture and Performance

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Concurrent Detection of Software and Hardware Data-Access Faults

IEEE Transactions on Computers
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
The future of multiprocessor systems-on-chips

Proceedings of the 41st annual Design Automation Conference
Energy-Aware Adaptive Checkpointing in Embedded Real-Time Systems

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
A Recovery Cache for the PDP-11

IEEE Transactions on Computers

Dynamic transient fault detection and recovery for embedded processor datapaths

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

While technology advances have made MPSoCs a standard architecture for embedded systems, their applicability is increasingly being challenged by dramatic increases in the amount of device failures that may occur during execution. Conventional fault tolerance techniques employ a duplication-and-comparison strategy to detect arbitrary execution faults, as well as a checkpointing-and-rollback strategy to recover from the faulty state. Comparison and checkpointing are performed either at task level, thus imposing a large amount of overhead in verifying and backing up memory pages, or at instruction level, thus necessitating a lock-step execution model which significantly limits the attainable performance. To overcome the shortcomings of both strategies, in this paper we propose a cache-based fault tolerance scheme wherein the comparison and checkpointing process is performed at the cache-memory interface. By allowing two processors that execute duplicated tasks to share a single data cache, the proposed scheme is able to verify execution results before writing them back into memory, thus protecting the memory from being polluted by execution faults. This in turn significantly reduces the checkpointing overhead. Meanwhile, since only the data written into memory are compared, the strict instruction-by-instruction synchronization model used in multithreading processors can be relaxed. The simulation results confirm that the proposed scheme only imposes a performance overhead ranging from 1.4% to 10.4%, while both fault detection and execution checkpointing can be effectively attained.