Networked Windows NT System Field Failure Data Analysis

Authors:
Jun Xu;Zbigniew Kalbarczyk;Ravishankar K. Iyer
Affiliations:
-;-;-
Venue:
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Year:
1999

Citing 0
Cited 23

An empirical study of operating systems errors

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Architecture and Dependability of Large-Scale Internet Services

IEEE Internet Computing
An Experimental Study of Security Vulnerabilities Caused by Errors

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
Destructive Transaction: Human-Oriented Cluster System Management Mechanism

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Advanced non-distributed operating systems course

ACM SIGCSE Bulletin
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Achieving Self-Healing in Autonomic Software Systems: a Case-Based Reasoning Approach

Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Current research and practice in proactive fault management

International Journal of Computers and Applications
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Evaluation of the device driver availability in dawning4000a

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems

International Journal of Adaptive, Resilient and Autonomic Systems
A reliability model for cloud computing for high performance computing applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Operating system reliability from the quality of experience viewpoint: an exploratory study

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%,(5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.