Analysis of Faults in an N-Version Software Experiment
IEEE Transactions on Software Engineering
FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Field testing for cosmic ray soft errors in semiconductor memories
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Multiple instruction issue in the NonStop cyclone processor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Framework for Testing the Fault-Tolerance of Systems Including OS and Network Aspects
HASE '01 The 6th IEEE International Symposium on High-Assurance Systems Engineering: Special Topic: Impact of Networking
A Fault Tolerant Approach to Microprocessor Design
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experimental Evaluation of Time-redundant Execution for a Brake-by-wire Application
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Experimental Analysis of the Errors Induced into Linux by Three Fault Injection Techniques
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
CMOS Electronics: How It Works, How It Fails
CMOS Electronics: How It Works, How It Fails
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor
Proceedings of the 31st annual international symposium on Computer architecture
Microarchitecture and Design Challenges for Gigascale Integration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
SWIFT: Software Implemented Fault Tolerance
Proceedings of the international symposium on Code generation and optimization
Design and Evaluation of Hybrid Fault-Detection Systems
Proceedings of the 32nd annual international symposium on Computer Architecture
DieHard: probabilistic memory safety for unsafe languages
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Static typing for a faulty lambda calculus
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Exterminator: automatically correcting memory errors with high probability
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
Proceedings of the International Symposium on Code Generation and Optimization
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Exploiting concurrency vulnerabilities in system call wrappers
WOOT '07 Proceedings of the first USENIX workshop on Offensive Technologies
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Speculative parallelization using software multi-threaded transactions
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Respec: efficient online multiprocessor replayvia speculation and external determinism
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Shoestring: probabilistic soft error reliability on the cheap
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Automated software diversity for hardware fault detection
ETFA'09 Proceedings of the 14th IEEE international conference on Emerging technologies & factory automation
DAFT: decoupled acyclic fault tolerance
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Epipe: A low-cost fault-tolerance technique considering WCET constraints
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications.