Studying and using failure data from large-scale internet services

Authors:
David Oppenheimer;David A. Patterson
Affiliations:
University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA
Venue:
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Year:
2002

Citing 1
Cited 5

Why Do Internet Services Fail, and What Can Be Done About It?

Why Do Internet Services Fail, and What Can Be Done About It?

A Simple Way to Estimate the Cost of Downtime

LISA '02 Proceedings of the 16th USENIX conference on System administration
Toward recovery-oriented computing

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A survey of system configuration tools

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Service engineering

Service research challenges and solutions for the future internet
Integrated management of network and security devices in IT infrastructures

Proceedings of the 7th International Conference on Network and Services Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale Internet services are the newest and arguably the most commercially important class of systems requiring 24x7 availability. As a result, very little information has been published about their causes of failure. In an attempt to address this deficiency, we have analyzed detailed failure reports from three large-scale Internet services. Our goals are to (1) identify the major factors contributing to user-visible failures, (2) evaluate the (potential) effectiveness of various techniques for preventing and mitigating service failure, and (3) build a fault model for service-level dependability and recovery benchmarks. Our initial results indicate that operator error and network problems are the leading contributors to user-visible failures, that failures in custom-written front-end software are significant, and that online testing and more thoroughly exposing and handling component failures would reduce failure rates in at least one service.