On the Reliability of the IBM MVS/XA Operating System
IEEE Transactions on Software Engineering
Experimental analysis of computer system dependability
Fault-tolerant computer system design
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
Software Dependability in the Tandem GUARDIAN System
IEEE Transactions on Software Engineering
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Study of Software Failures and Recovery in the MVS Operating System
IEEE Transactions on Computers
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
IEEE Transactions on Software Engineering
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Failure Mode Analysis of CORBA Service Implementations
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
An Experimental Study of Security Vulnerabilities Caused by Errors
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Joint evaluation of recovery and performance of a COTS DBMS in the presence of operator faults
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Quantifying the Performability of Cluster-Based Services
IEEE Transactions on Parallel and Distributed Systems
Emulation of Software Faults: A Field Data Study and a Practical Approach
IEEE Transactions on Software Engineering
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability
Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
A dependability benchmark for OLTP application environments
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Achieving Self-Healing in Autonomic Software Systems: a Case-Based Reasoning Approach
Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Memory leak analysis of mission-critical middleware
Journal of Systems and Software
How to advance TPC benchmarks with dependability aspects
TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
From Autonomic to Self-Self Behaviors: The JADE Experience
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Innovations in Systems and Software Engineering
Operating system reliability from the quality of experience viewpoint: an exploratory study
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
This paper presents results of a failure data analysis of a LAN of Windows NT machines. Data for the study was obtained from event logs collected over a six-month period from the mail routing network of a commercial organization. The study focuses on characterizing causes of machine reboots. The key observations from this study are: (1) most of the problems that lead to reboots are software related, (2) rebooting the machine does not always solve the problem (in about 60% of the reboots, the re-booted machine reported problems within an hour or two of the reboot), (3) there are indications of propagated or correlated failures, and (4) though the average availability evaluates to over 99%, the machine downtime lasts (on average) two hours. Since the machines are dedicated mail servers, bringing down one or more of them can potentially disrupt storage, forwarding, reception and delivery of mail. This suggests that the average availability is not a good measure to characterize this type of network service.