Why Do Internet Services Fail, and What Can Be Done About It?
Why Do Internet Services Fail, and What Can Be Done About It?
A Simple Way to Estimate the Cost of Downtime
LISA '02 Proceedings of the 16th USENIX conference on System administration
Toward recovery-oriented computing
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A survey of system configuration tools
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Service research challenges and solutions for the future internet
Integrated management of network and security devices in IT infrastructures
Proceedings of the 7th International Conference on Network and Services Management
Hi-index | 0.00 |
Large-scale Internet services are the newest and arguably the most commercially important class of systems requiring 24x7 availability. As a result, very little information has been published about their causes of failure. In an attempt to address this deficiency, we have analyzed detailed failure reports from three large-scale Internet services. Our goals are to (1) identify the major factors contributing to user-visible failures, (2) evaluate the (potential) effectiveness of various techniques for preventing and mitigating service failure, and (3) build a fault model for service-level dependability and recovery benchmarks. Our initial results indicate that operator error and network problems are the leading contributors to user-visible failures, that failures in custom-written front-end software are significant, and that online testing and more thoroughly exposing and handling component failures would reduce failure rates in at least one service.