A measurement-based model for workload dependence of CPU errors

  • Authors:
  • Ravishankar K. Iyer;David J. Rossetti

  • Affiliations:
  • Univ. of Illinois, Urbana, IL;Metaphor Computer Systems, Mountain View, CA

  • Venue:
  • IEEE Transactions on Computers - The MIT Press scientific computation series
  • Year:
  • 1986

Quantified Score

Hi-index 0.02

Visualization

Abstract

This paper proposes and validates a methodology to measure explicitly the increase in the risk of a processor error with increasing workload. By relating the occurrence of a CPU related error to the system activity just prior to the occurrence of an error, the approach measures the dynamic CPU workload/failure relationship. The measurements show that the probability of a CPU related error (the load hazard) increases nonlinearly with increasing workload; i.e., the CPU rapidly deteriorates as end points are reached. The load hazard is observed to be most sensitive to system CPU utilization, the I/O rate, and the interrupt rates. The results are significant because they indicate that it may not be useful to push a system close to its performance limits (the previously accepted operating goal) since what we gain in slightly improved performance is more than offset by the degradation in reliability. Importantly, they also indicate that conventional reliability models need to be reevaluated so as to take system work-load explicity into account.