Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
On Evaluating the Performability of Degradable Computing Systems
IEEE Transactions on Computers
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
The cost of a cloud: research problems in data center networks
ACM SIGCOMM Computer Communication Review
SPECI, a Simulation Tool Exploring Cloud-Scale Data Centres
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Helios: a hybrid electrical/optical switch architecture for modular data centers
Proceedings of the ACM SIGCOMM 2010 conference
Review: A survey on architectures and energy efficiency in Data Center Networks
Computer Communications
Hi-index | 0.00 |
There has been a recent interest in modularized shipping containers as the building block for data centers. However, there are no published results on the different design tradeoffs it offers. In this paper we investigate a model where such a container is never serviced during its deployment lifetime, say 3 years, for hardware faults. Instead, the hardware is over-provisioned in the beginning and failures are handled gracefully by software. The reasons vary from ease of accounting and management to increased design flexibility owing to its sealed and service-free nature. We present a preliminary model for performance, reliability and cost for such service-less containerized solutions. There are a number of design choices/policies for over-provisioning the containers. For instance, as a function of dead servers and incoming workload we could decide which servers to selectively turn on/off while still maintaining a desired level of performance. While evaluating each such choice is challenging, we demonstrate that arriving at the best and worst-case design is tractable. We further demonstrate that projected lifetimes of these extreme cases are very close to each other, often no more than 10% different. One way to interpret this number is, from a reliability perspective the utility of keeping machines as cold spares within the container, in anticipation of server failures, is not too different than starting out with all machines active. So as we engineer the containers in sophisticated ways for cost and performance, we can arrive at the associated reliability estimates using a simpler more-tractable approximation. We demonstrate that these bounds are robust to general distributions for failure times of servers. We hope that this paper stirs up a number of research investigations geared towards understanding these next generation data center building blocks. This involves both improving the models and corroborating them with field data.