Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

Authors:
Sriram Sankar;Mark Shaw;Kushagra Vaid;Sudhanva Gurumurthi
Affiliations:
Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;University of Virginia
Venue:
ACM Transactions on Storage (TOS)
Year:
2013

Citing 10
Cited 0

Disk Drive Roadmap from the Thermal Perspective: A Case for Dynamic Thermal Management

Proceedings of the 32nd annual international symposium on Computer Architecture
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Interplay of energy and performance for disk arrays running transaction processing workloads

ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Intra-disk Parallelism: An Idea Whose Time Has Come

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Temperature management in data centers: why some (might) like it hot

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.