PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

Authors:
Alex Shye;Joseph Blomstedt;Tipp Moseley;Vijay Janapa Reddi;Daniel A. Connors
Affiliations:
Northwestern University, Evanston;University of Colorado, Boulder;University of Colorado, Boulder;Harvard University, Cambridge;University of Colorado, Boulder
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2009

Citing 0
Cited 10

Variant-based competitive parallel execution of sequential programs

Proceedings of the 7th ACM international conference on Computing frontiers
Using hardware vulnerability factors to enhance AVF analysis

Proceedings of the 37th annual international symposium on Computer architecture
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Automated application of fault tolerance mechanisms in a component-based system

Proceedings of the 9th International Workshop on Java Technologies for Real-Time and Embedded Systems
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A work-stealing scheduling framework supporting fault tolerance

Proceedings of the Conference on Design, Automation and Test in Europe
Efficient software-based fault tolerance approach on multicore platforms

Proceedings of the Conference on Design, Automation and Test in Europe
Multiverse: efficiently supporting distributed high-level speculation

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Proceedings of the 4th annual Symposium on Cloud Computing
A dual process redundancy approach to transient fault tolerance for ccNUMA architecture

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.