On the Reliability of the IBM MVS/XA Operating System
IEEE Transactions on Software Engineering
Performance Modeling Based on Real Data: A Case Study
IEEE Transactions on Computers - Fault-Tolerant Computing
Fault Injection for Dependability Validation: A Methodology and Some Applications
IEEE Transactions on Software Engineering
Fault Injection Experiments Using FIAT
IEEE Transactions on Computers
FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Experimental analysis of computer system dependability
Fault-tolerant computer system design
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
IEEE Transactions on Computers
Software Dependability in the Tandem GUARDIAN System
IEEE Transactions on Software Engineering
An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Experimental Evaluation of a COTS System for Space Application
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
IPDS '00 Proceedings of the 4th International Computer Performance and Dependability Symposium
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Hierarchical error detection in a software implemented fault tolerance (sift) environment
Hierarchical error detection in a software implemented fault tolerance (sift) environment
A Study of Software Failures and Recovery in the MVS Operating System
IEEE Transactions on Computers
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
IEEE Transactions on Software Engineering
Hi-index | 0.00 |
The discussion in this paper focuses on the issues involved in analyzing the availability of networked systems using fault injection and the failure data collected by the logging mechanisms built into the system. In particular we address: (1) analysis in the prototype phase using physical fault injection to an actual system. We use example of fault injection-based evaluation of a software-implemented fault tolerance (SIFT) environment (built around a set of self-checking processes called ARMORS) that provides error detection and recovery services to spaceborne scientific applications and (2) measurement-based analysis of systems in the field. We use example of LAN of Windows NT based computers to present methods for collecting and analyzing failure data to characterize network system dependability. Both, fault injection and failure data analysis enable us to study naturally occurring errors and to provide feedback to system designers on potential availability bottlenecks. For example, the study of failures in a network of Windows NT machines reveals that most of the problems that lead to reboots are software related and that though the average availability evaluates to over 99%, a typical machine, on average, provides acceptable service only about 92% of the time.