Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
Fault Injection Experiments Using FIAT
IEEE Transactions on Computers
NEST: a network simulation and prototyping testbed
Communications of the ACM - Special issue on simulation
Experimental analysis of computer system dependability
Fault-tolerant computer system design
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
IEEE Transactions on Computers
Testing of fault-tolerant and real-time distributed systems via protocol fault injection
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Experimental evaluation of the fail-silent behaviour in programs with consistency checks
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
An approach towards benchmarking of fault-tolerant commercial systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
DOCTOR: an integrated software fault injection environment for distributed real-time systems
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Assessing the effects of communication faults on parallel applications
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
IEEE Transactions on Parallel and Distributed Systems
Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms
IEEE Transactions on Computers
Assessing Error Detection Coverage by Simulated Fault Injection
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
From Experimental Assessment of Fault-Tolerant Systems to Dependability Benchmarking
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Comparison of Physical and Software-Implemented Fault Injection Techniques
IEEE Transactions on Computers
Early, Accurate Dependability Analysis of CAN-Based Networked Systems
IEEE Design & Test
SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
Hi-index | 14.99 |
This paper presents a dependability study of high-speed, switched Local Area Networks (LANs) using Myrinet as an example testbed (with theoretical speeds of 2.56 Gbps). The study uses results of two fault injection methods, simulated fault injection and software-implemented fault injection (SWIFI), to analyze the application-level impact of transient faults injected into the network interface hardware. These results include a number of errors, such as dropped or corrupt messages, host interface or host resets, and local or remote host interface hangs. The paper presents the study in two parts: First, the results from the SWIFI method in the real system are used as a basis to validate the simulation and identify the major factors leading to differences between the methods. A comparison between the two injection methods shows that they agree for 83 percent of the fault injections. The results, however, vary greatly, depending on the fault type considered. The study also presents an analysis of the effects of varying workload intensity, host platform, and interface function targeted by the injection. An example of this analysis is to show that the function targeted has a significant impact on the fault activation rate. Finally, the study identifies two mechanisms by which faults may propagate from the interface to other parts of the network; in one example, this propagation caused the interface's host computer to reboot, while another caused a remote interface in the network to hang