An On-Line Algorithm for Checkpoint Placement
IEEE Transactions on Computers
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
A first order approximation to the optimum checkpoint interval
Communications of the ACM
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Parameter variations and impact on circuits and microarchitecture
Proceedings of the 40th annual Design Automation Conference
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Impact of Technology Scaling on Lifetime Reliability
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Min-Max Checkpoint Placement under Incomplete Failure Information
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Energy-Aware Adaptive Checkpointing in Embedded Real-Time Systems
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes
IEEE Transactions on Dependable and Secure Computing
Opportunistic Transient-Fault Detection
Proceedings of the 32nd annual international symposium on Computer Architecture
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids
IEEE Transactions on Parallel and Distributed Systems
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Evaluating cooperative checkpointing for supercomputing systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Trade-offs in transient fault recovery schemes for redundant multithreaded processors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Hi-index | 0.00 |
The ever scaling-down feature size and noise margin keep elevating hardware failure rates, requiring the incorporation of fault tolerance into computer systems. One fault tolerance scheme that receives a lot of research attention is redundant execution. However, existing solutions are developed under the assumption that the fault rate is low. These techniques either solely focus on fault detection, or sometimes even increase recovery cost to reduce fault detection overhead. The lack of overall efficiency makes them insufficient and inappropriate for embedded systems with tight energy and cost budget. Our study shows that checkpoint frequency and fault rate are two critical parameters determining the overall fault detection and recovery overhead. To co-optimize detection and recovery, we statically construct a mathematical model, capable of taking application and architecture characteristics into consideration and identifying the optimal checkpoint frequency of an application for a given fault rate. Moreover, as the fault rate is infeasible to predict a priori, we furthermore propose a set of heuristics, enabling the system to dynamically monitor the fault rate and adapt the checkpoint frequency accordingly. The efficacy of the static and the adaptive optimizations is evaluated through detailed instruction-level simulation. The results show that the optimal checkpoint frequency identified by the static model is very close to the actual value (6% deviation) and the run-time adaptation scheme effectively reduces the overhead caused by the unpredictability in fault rate.