Modular data centers: how to design them?

Authors:
Kashi Venkatesh Vishwanath;Albert Greenberg;Daniel A. Reed
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Year:
2009

Citing 8
Cited 4

Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
On Evaluating the Performability of Degradable Computing Systems

IEEE Transactions on Computers
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
The cost of a cloud: research problems in data center networks

ACM SIGCOMM Computer Communication Review

SPECI, a Simulation Tool Exploring Cloud-Scale Data Centres

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Helios: a hybrid electrical/optical switch architecture for modular data centers

Proceedings of the ACM SIGCOMM 2010 conference
Review: A survey on architectures and energy efficiency in Data Center Networks

Computer Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been a recent interest in modularized shipping containers as the building block for data centers. However, there are no published results on the different design tradeoffs it offers. In this paper we investigate a model where such a container is never serviced during its deployment lifetime, say 3 years, for hardware faults. Instead, the hardware is over-provisioned in the beginning and failures are handled gracefully by software. The reasons vary from ease of accounting and management to increased design flexibility owing to its sealed and service-free nature. We present a preliminary model for performance, reliability and cost for such service-less containerized solutions. There are a number of design choices/policies for over-provisioning the containers. For instance, as a function of dead servers and incoming workload we could decide which servers to selectively turn on/off while still maintaining a desired level of performance. While evaluating each such choice is challenging, we demonstrate that arriving at the best and worst-case design is tractable. We further demonstrate that projected lifetimes of these extreme cases are very close to each other, often no more than 10% different. One way to interpret this number is, from a reliability perspective the utility of keeping machines as cold spares within the container, in anticipation of server failures, is not too different than starting out with all machines active. So as we engineer the containers in sophisticated ways for cost and performance, we can arrive at the associated reliability estimates using a simpler more-tractable approximation. We demonstrate that these bounds are robust to general distributions for failure times of servers. We hope that this paper stirs up a number of research investigations geared towards understanding these next generation data center building blocks. This involves both improving the models and corroborating them with field data.