Fault tolerant and fault testable hardware design
Fault tolerant and fault testable hardware design
Processor Control Flow Monitoring Using Signatured Instruction Streams
IEEE Transactions on Computers
Fault Injection for Dependability Validation: A Methodology and Some Applications
IEEE Transactions on Software Engineering
RIFLE: A General Purpose Pin-level Fault Injector
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Experimental evaluation of the fail-silent behaviour in programs with consistency checks
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Assessing the effects of communication faults on parallel applications
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
EXFI: a low-cost fault injection system for embedded microprocessor-based boards
ACM Transactions on Design Automation of Electronic Systems (TODAES)
IEEE Transactions on Parallel and Distributed Systems
Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms
IEEE Transactions on Computers
The Design and Verification of the Rio File Cache
IEEE Transactions on Computers
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
IEEE Transactions on Software Engineering
Fault-Detection by Result-Checking for the Eigenproblem
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Assessing Error Detection Coverage by Simulated Fault Injection
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Software Development Kit for Dependable Applications in Embedded
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Predicting Device Performance From Pass/Fail Transient Signal Analysis Data
ITC '00 Proceedings of the 2000 IEEE International Test Conference
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A case for virtual machine based fault injection in a high-performance computing environment
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Hi-index | 0.01 |
In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel machine running several real applications. The results showed that a significant percentage of faults caused the system to produce wrong results while the application seemed to terminate normally, thus demonstrating that fault tolerance techniques are required in parallel systems, not only to assure that long-running applications can terminate but also (and more important) that the results produced are correct. Of the techniques tested to reduce the percentage of undetected wrong results only ABFT proved to be effective. For other simple error detection methods to be effective, they have to be designed in, and not added as an after thought. Faults injected in the communication subsystem proved the effectiveness of end-to-end CRCs on the data movements between processors.