Measurement and modeling of computer reliability as affected by system activity

  • Authors:
  • R. K. Iyer;D. J. Rossetti;M. C. Hsueh

  • Affiliations:
  • Univ. of Illinois at Urbana-Champaign, Urbana;Stanford Univ., Stanford, CA;Univ. of Illinois at Urbana-Champaign, Urbana

  • Venue:
  • ACM Transactions on Computer Systems (TOCS)
  • Year:
  • 1986

Quantified Score

Hi-index 0.03

Visualization

Abstract

This paper demonstrates a practical approach to the study of the failure behavior of computer systems. Particular attention is devoted to the analysis of permanent failures. A number of important techniques, which may have general applicability in both failure and workload analysis, are brought together in this presentation. These include: smeared averaging of the workload data, clustering of like failures, and joint analysis of workload and failures. Approximately 17 percent of all failures affecting the CPU were estimated to be permanent. The manifestation of a permanent failure was found to be strongly correlated with the level and type of workload prior to the failure. Although, in strict terms, the results only relate to the manifestation of permanent failures and not to their occurrence, there are strong indications that permanent failures are both caused and discovered by increased activity. More measurements and experiments are necessary to determine their respective contributions to the measured workload/failure relationship.