Characterizing cloud computing hardware reliability

Authors:
Kashi Venkatesh Vishwanath;Nachiappan Nagappan
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 1st ACM symposium on Cloud computing
Year:
2010

Citing 14
Cited 29

Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
On designing and deploying internet-scale services

LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Modular data centers: how to design them?

Proceedings of the 1st ACM workshop on Large-Scale system and application performance
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
PortLand: a scalable fault-tolerant layer 2 data center network fabric

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

Server operational cost optimization for cloud computing service providers over a time horizon

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Modeling cloud failure data: a case study of the virtual computing lab

Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
PREFAIL: a programmable tool for multiple-failure injection

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Differentiated Availability in Cloud Computing SLAs

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Enhancing TCP throughput of highly available virtual machines via speculative communication

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
The data furnace: heating up with cloud computing

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Compositional reverification of probabilistic safety properties for large-scale complex IT systems

Proceedings of the 17th Monterey conference on Large-Scale Complex IT Systems: development, operation and management
A reliability model for cloud computing for high performance computing applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
An incremental verification framework for component-based software systems

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Cloud API issues: an empirical study and impact

Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
Tradeoffs in compressing virtual machine checkpoints

Proceedings of the 7th international workshop on Virtualization technologies in distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
Using dark fiber to displace diesel generators

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage (TOS)
Cloud Computing Operations Research

Service Science
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
A survey on reliability in distributed systems

Journal of Computer and System Sciences
Performance evaluation of dynamic cloud resource migration based on temporal and capacity-aware policy for efficient resource sharing

Proceedings of the 2nd ACM workshop on High performance mobile opportunistic systems
Synthetic Hardware Performance Analysis in Virtualized Cloud Environment for Healthcare Organization

Journal of Medical Systems
Cloud engineering is Search Based Software Engineering too

Journal of Systems and Software
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
FTCloudSim: a simulation tool for cloud service reliability enhancement mechanisms

Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference
A solution for optimizing recovery time in cloud computing

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
A Cost-Capacity Analysis for Assessing the Efficiency of Heterogeneous Computing Assets in an Enterprise Cloud

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Analyzing job completion reliability and job energy consumption for a general MapReduce infrastructure

Journal of High Speed Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception. Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.