Fault tolerant and fault testable hardware design
Fault tolerant and fault testable hardware design
Fault Injection for Dependability Validation: A Methodology and Some Applications
IEEE Transactions on Software Engineering
Fault Injection Experiments Using FIAT
IEEE Transactions on Computers
Estimators for Fault Tolerance Coverage Evaluation
IEEE Transactions on Computers - Special issue on fault-tolerant computing
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
IEEE Transactions on Computers
IEEE Transactions on Computers
Reliable computer systems (3rd ed.): design and evaluation
Reliable computer systems (3rd ed.): design and evaluation
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Fault Injection Techniques and Tools
Computer
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
IEEE Transactions on Software Engineering
RIFLE: A General Purpose Pin-level Fault Injector
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
On Stratified Sampling for High Coverage Estimations
EDCC-2 Proceedings of the Second European Dependable Computing Conference on Dependable Computing
A TeraFLOP Supercomputer in 1996: The ASCI TFLOP System
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Experimental assessment of parallel systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A Comparison of Simulation Based and Scan Chain Implemented Fault Injection
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Validation of the Fault/Error Handling Mechanisms of the Teraflops Supercomputer
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Technique for Automated Validation of Fault Tolerant Designs Using Laser Fault Injection (LFI)
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Using Physical and Simulated Fault Injection to Evaluate Error Detection Mechanisms
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
DOCTOR: an integrated software fault injection environment for distributed real-time systems
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Hi-index | 14.98 |
Intel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation of any fault-tolerant or highly available computing system.