Failure Data Analysis of a LAN of Windows NT Based Computers

Authors:
M. Kalyanakrishnam;Z. Kalbarczyk;R. Iyer
Affiliations:
-;-;-
Venue:
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Year:
1999

Citing 7
Cited 24

On the Reliability of the IBM MVS/XA Operating System

IEEE Transactions on Software Engineering
Experimental analysis of computer system dependability

Fault-tolerant computer system design
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
Software Dependability in the Tandem GUARDIAN System

IEEE Transactions on Software Engineering
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Study of Software Failures and Recovery in the MVS Operating System

IEEE Transactions on Computers
Effect of System Workload on Operating System Reliability: A Study on IBM 3081

IEEE Transactions on Software Engineering

Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Failure Mode Analysis of CORBA Service Implementations

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
An Experimental Study of Security Vulnerabilities Caused by Errors

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Joint evaluation of recovery and performance of a COTS DBMS in the presence of operator faults

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Quantifying the Performability of Cluster-Based Services

IEEE Transactions on Parallel and Distributed Systems
Emulation of Software Faults: A Field Data Study and a Practical Approach

IEEE Transactions on Software Engineering
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
A dependability benchmark for OLTP application environments

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Achieving Self-Healing in Autonomic Software Systems: a Case-Based Reasoning Approach

Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Memory leak analysis of mission-critical middleware

Journal of Systems and Software
How to advance TPC benchmarks with dependability aspects

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
From Autonomic to Self-Self Behaviors: The JADE Experience

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Proposed future Internet

Innovations in Systems and Software Engineering
Operating system reliability from the quality of experience viewpoint: an exploratory study

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents results of a failure data analysis of a LAN of Windows NT machines. Data for the study was obtained from event logs collected over a six-month period from the mail routing network of a commercial organization. The study focuses on characterizing causes of machine reboots. The key observations from this study are: (1) most of the problems that lead to reboots are software related, (2) rebooting the machine does not always solve the problem (in about 60% of the reboots, the re-booted machine reported problems within an hour or two of the reboot), (3) there are indications of propagated or correlated failures, and (4) though the average availability evaluates to over 99%, the machine downtime lasts (on average) two hours. Since the machines are dedicated mail servers, bringing down one or more of them can potentially disrupt storage, forwarding, reception and delivery of mail. This suggests that the average availability is not a good measure to characterize this type of network service.