Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Authors:
Hao Chen;Chengmo Yang
Affiliations:
University of Delaware, Newark, DE;University of Delaware, Newark, DE
Venue:
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Year:
2013

Citing 27
Cited 0

An On-Line Algorithm for Checkpoint Placement

IEEE Transactions on Computers
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
A first order approximation to the optimum checkpoint interval

Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Min-Max Checkpoint Placement under Incomplete Failure Information

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Energy-Aware Adaptive Checkpointing in Embedded Real-Time Systems

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes

IEEE Transactions on Dependable and Secure Computing
Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth

IEEE Micro
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Transient-Fault Recovery for Chip Multiprocessors

IEEE Micro
A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Evaluating cooperative checkpointing for supercomputing systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Trade-offs in transient fault recovery schemes for redundant multithreaded processors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever scaling-down feature size and noise margin keep elevating hardware failure rates, requiring the incorporation of fault tolerance into computer systems. One fault tolerance scheme that receives a lot of research attention is redundant execution. However, existing solutions are developed under the assumption that the fault rate is low. These techniques either solely focus on fault detection, or sometimes even increase recovery cost to reduce fault detection overhead. The lack of overall efficiency makes them insufficient and inappropriate for embedded systems with tight energy and cost budget. Our study shows that checkpoint frequency and fault rate are two critical parameters determining the overall fault detection and recovery overhead. To co-optimize detection and recovery, we statically construct a mathematical model, capable of taking application and architecture characteristics into consideration and identifying the optimal checkpoint frequency of an application for a given fault rate. Moreover, as the fault rate is infeasible to predict a priori, we furthermore propose a set of heuristics, enabling the system to dynamically monitor the fault rate and adapt the checkpoint frequency accordingly. The efficacy of the static and the adaptive optimizations is evaluated through detailed instruction-level simulation. The results show that the optimal checkpoint frequency identified by the static model is very close to the actual value (6% deviation) and the run-time adaptation scheme effectively reduces the overhead caused by the unpredictability in fault rate.