Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

Authors:
Hao Chen;Chengmo Yang
Affiliations:
University of Delaware, Newark, DE, USA;University of Delaware, Newark, DE, USA
Venue:
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Year:
2013

Citing 22
Cited 1

Dynamic instruction reuse

Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Advanced compiler design and implementation

Advanced compiler design and implementation
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
A unified approach to global program optimization

POPL '73 Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Customizable Embedded Processor Architectures

DSD '03 Proceedings of the Euromicro Symposium on Digital Systems Design
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SlicK: slice-based locality exploitation for efficient redundant multithreading

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Effective loop partitioning and scheduling under memory and register dual constraints

Proceedings of the conference on Design, automation and test in Europe
Transient-Fault Recovery for Chip Multiprocessors

IEEE Micro
A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Trade-offs in transient fault recovery schemes for redundant multithreaded processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

While the unending technology scaling has brought reliability to the forefront of concerns of semiconductor industry, fault tolerance techniques are still rarely incorporated into existing designs due to their high overhead. One fault tolerance scheme that receives a lot of research attention is duplication and checkpointing. However, most of the techniques in the category employ a blind strategy to compare instruction results, therefore not only generating large overhead in buffering and verifying these values, but also inducing unnecessary rollbacks to recover faults that will never influence subsequent execution. To tackle these issues, we introduce in this paper an approach that identifies the minimum set of instruction results for fault detection and checkpointing. For a given application, the proposed technique first identifies the control and data flow information of each execution hotspot, and then selects only the instruction results that either influence the final program results or are needed during re-execution as the comparison set. Our experimental studies demonstrate that the proposed hotspot-targeting technique is able to reduce nearly 88% of the comparison overhead and mask over 38% of the total injected faults of all the injected faults while at the same time delivering full fault coverage.