Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

  • Authors:
  • Edmund B. Nightingale;John R. Douceur;Vince Orgovan

  • Affiliations:
  • Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA

  • Venue:
  • Proceedings of the sixth conference on Computer systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.