An empirical study of operating systems errors
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Architecture and Dependability of Large-Scale Internet Services
IEEE Internet Computing
An Experimental Study of Security Vulnerabilities Caused by Errors
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Advanced non-distributed operating systems course
ACM SIGCSE Bulletin
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Achieving Self-Healing in Autonomic Software Systems: a Case-Based Reasoning Approach
Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
Current research and practice in proactive fault management
International Journal of Computers and Applications
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Evaluation of the device driver availability in dawning4000a
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
A reliability model for cloud computing for high performance computing applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Operating system reliability from the quality of experience viewpoint: an exploratory study
Proceedings of the 28th Annual ACM Symposium on Applied Computing
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%,(5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.