Variant-based competitive parallel execution of sequential programs
Proceedings of the 7th ACM international conference on Computing frontiers
Using hardware vulnerability factors to enhance AVF analysis
Proceedings of the 37th annual international symposium on Computer architecture
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Automated application of fault tolerance mechanisms in a component-based system
Proceedings of the 9th International Workshop on Java Technologies for Real-Time and Embedded Systems
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A work-stealing scheduling framework supporting fault tolerance
Proceedings of the Conference on Design, Automation and Test in Europe
Efficient software-based fault tolerance approach on multicore platforms
Proceedings of the Conference on Design, Automation and Test in Europe
Multiverse: efficiently supporting distributed high-level speculation
Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
COLO: COarse-grained LOck-stepping virtual machines for non-stop service
Proceedings of the 4th annual Symposium on Cloud Computing
Hi-index | 0.00 |
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.