Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Authors:
Edmund B. Nightingale;John R. Douceur;Vince Orgovan
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA
Venue:
Proceedings of the sixth conference on Computer systems
Year:
2011

Citing 16
Cited 12

Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Impact of Deep Submicron Technology on Dependability of VLSI Circuits

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Debugging in the (very) large: ten years of implementation and experience

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles

Paxos replicated state machines as the basis of a high-performance data store

Proceedings of the 8th USENIX conference on Networked systems design and implementation
VIPER: verifying the integrity of PERipherals' firmware

Proceedings of the 18th ACM conference on Computer and communications security
Providing fault-tolerant execution of web-service-based workflows within clouds

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Software execution protection in the cloud

Proceedings of the 1st European Workshop on Dependable Cloud Computing
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Communication and migration energy aware design space exploration for multicore systems with intermittent faults

Proceedings of the Conference on Design, Automation and Test in Europe
Automated debugging for arbitrarily long executions

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
ForEVeR: A complementary formal and runtime verification approach to correct NoC functionality

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing
uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient online memory error assessment and circumvention for Linux with RAMpage

International Journal of Critical Computer-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.