Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

  • Authors:
  • Ramendra K. Sahoo;Anand Sivasubramaniam;Mark S. Squillante;Yanyong Zhang

  • Affiliations:
  • IBM Thomas J. Watson Research Center, Yorktown Heights, NY;Pennsylvania State University, University Park, PA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY;Rutgers University, Piscataway, NJ

  • Venue:
  • DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The growing complexity of hardware and software mandatesthe recognition of fault occurrence in system deploymentand management. While there are several techniquesto prevent and/or handle faults, there continues to be agrowing need for an in-depth understanding of system errorsand failures and their empirical and statistical properties.This understanding can help evaluate the effectivenessof different techniques for improving system availability, inaddition to developing new solutions. In this paper, we analyzethe empirical and statistical properties of system errorsand failures from a network of nearly 400 heterogeneousservers running a diverse workload over a year. While improvementsin system robustness continue to limit the numberof actual failures to a very small fraction of the recordederrors, the failure rates are significant and highly variable.Our results also show that the system error and failure patternsare comprised of time-varying behavior containinglong stationary intervals. These stationary intervals exhibitvarious strong correlation structures and periodic patterns,which impact performance but also can be exploited to addresssuch performance issues.