Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
On designing and deploying internet-scale services
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Modular data centers: how to design them?
Proceedings of the 1st ACM workshop on Large-Scale system and application performance
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
PortLand: a scalable fault-tolerant layer 2 data center network fabric
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
Server operational cost optimization for cloud computing service providers over a time horizon
Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Modeling cloud failure data: a case study of the virtual computing lab
Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
PREFAIL: a programmable tool for multiple-failure injection
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Differentiated Availability in Cloud Computing SLAs
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Enhancing TCP throughput of highly available virtual machines via speculative communication
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
The data furnace: heating up with cloud computing
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Understanding the effects and implications of compute node related failures in hadoop
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Compositional reverification of probabilistic safety properties for large-scale complex IT systems
Proceedings of the 17th Monterey conference on Large-Scale Complex IT Systems: development, operation and management
A reliability model for cloud computing for high performance computing applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
An incremental verification framework for component-based software systems
Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Cloud API issues: an empirical study and impact
Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
Tradeoffs in compressing virtual machine checkpoints
Proceedings of the 7th international workshop on Virtualization technologies in distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Using dark fiber to displace diesel generators
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures
ACM Transactions on Storage (TOS)
Cloud Computing Operations Research
Service Science
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
A survey on reliability in distributed systems
Journal of Computer and System Sciences
Proceedings of the 2nd ACM workshop on High performance mobile opportunistic systems
Synthetic Hardware Performance Analysis in Virtualized Cloud Environment for Healthcare Organization
Journal of Medical Systems
Cloud engineering is Search Based Software Engineering too
Journal of Systems and Software
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
FTCloudSim: a simulation tool for cloud service reliability enhancement mechanisms
Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference
A solution for optimizing recovery time in cloud computing
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Journal of High Speed Networks
Hi-index | 0.00 |
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception. Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.