Studying and using failure data from large-scale internet services

  • Authors:
  • David Oppenheimer;David A. Patterson

  • Affiliations:
  • University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA

  • Venue:
  • EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large-scale Internet services are the newest and arguably the most commercially important class of systems requiring 24x7 availability. As a result, very little information has been published about their causes of failure. In an attempt to address this deficiency, we have analyzed detailed failure reports from three large-scale Internet services. Our goals are to (1) identify the major factors contributing to user-visible failures, (2) evaluate the (potential) effectiveness of various techniques for preventing and mitigating service failure, and (3) build a fault model for service-level dependability and recovery benchmarks. Our initial results indicate that operator error and network problems are the leading contributors to user-visible failures, that failures in custom-written front-end software are significant, and that online testing and more thoroughly exposing and handling component failures would reduce failure rates in at least one service.