Modular data centers: how to design them?

  • Authors:
  • Kashi Venkatesh Vishwanath;Albert Greenberg;Daniel A. Reed

  • Affiliations:
  • Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA

  • Venue:
  • Proceedings of the 1st ACM workshop on Large-Scale system and application performance
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has been a recent interest in modularized shipping containers as the building block for data centers. However, there are no published results on the different design tradeoffs it offers. In this paper we investigate a model where such a container is never serviced during its deployment lifetime, say 3 years, for hardware faults. Instead, the hardware is over-provisioned in the beginning and failures are handled gracefully by software. The reasons vary from ease of accounting and management to increased design flexibility owing to its sealed and service-free nature. We present a preliminary model for performance, reliability and cost for such service-less containerized solutions. There are a number of design choices/policies for over-provisioning the containers. For instance, as a function of dead servers and incoming workload we could decide which servers to selectively turn on/off while still maintaining a desired level of performance. While evaluating each such choice is challenging, we demonstrate that arriving at the best and worst-case design is tractable. We further demonstrate that projected lifetimes of these extreme cases are very close to each other, often no more than 10% different. One way to interpret this number is, from a reliability perspective the utility of keeping machines as cold spares within the container, in anticipation of server failures, is not too different than starting out with all machines active. So as we engineer the containers in sophisticated ways for cost and performance, we can arrive at the associated reliability estimates using a simpler more-tractable approximation. We demonstrate that these bounds are robust to general distributions for failure times of servers. We hope that this paper stirs up a number of research investigations geared towards understanding these next generation data center building blocks. This involves both improving the models and corroborating them with field data.