DieCast: Testing Distributed Systems with an Accurate Scale Model

Authors:
Diwaker Gupta;Kashi Venkatesh Vishwanath;Marvin McNett;Amin Vahdat;Ken Yocum;Alex Snoeren;Geoffrey M. Voelker
Affiliations:
Maginatics, Inc.;Google;Microsoft;University of California, San Diego;University of California, San Diego;University of California, San Diego;University of California, San Diego
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2011

Citing 30
Cited 2

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Dummynet: a simple approach to the evaluation of network protocols

ACM SIGCOMM Computer Communication Review
Bochs: A Portable PC Emulator for Unix/X

Linux Journal
Genesis: a system for large-scale parallel network simulation

Proceedings of the sixteenth workshop on Parallel and distributed simulation
Performance and scalability of EJB applications

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Congestion control for high bandwidth-delay product networks

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
The Georgia Tech Network Simulator

MoMeTools '03 Proceedings of the ACM SIGCOMM workshop on Models, methods and tools for reproducible network research
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A solver for the network testbed mapping problem

ACM SIGCOMM Computer Communication Review
Memory resource management in VMware ESX server

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Resource overbooking and application profiling in shared hosting platforms

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalability and accuracy in a large-scale network emulator

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
QEMU, a fast and portable dynamic translator

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Monkey see, monkey do: a tool for TCP tracing and replaying

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TCP offload is a dumb idea whose time has come

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Parallax: managing storage for a million machines

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Glacier: highly durable, decentralized storage despite massive correlated failures

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Quorum: flexible quality of service for internet services

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Model-based resource provisioning in a web service utility

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Experiences building PlanetLab

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Evaluating distributed systems: does background traffic matter?

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Difference engine: harnessing memory redundancy in virtual machines

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

High-fidelity switch models for software-defined network emulation

Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
Challenges in the emulation of large scale software defined networks

Proceedings of the 4th Asia-Pacific Workshop on Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale network services can consist of tens of thousands of machines running thousands of unique software configurations spread across hundreds of physical networks. Testing such services for complex performance problems and configuration errors remains a difficult problem. Existing testing techniques, such as simulation or running smaller instances of a service, have limitations in predicting overall service behavior at such scales. Testing large services should ideally be done at the same scale and configuration as the target deployment, which can be technically and economically infeasible. We present DieCast, an approach to scaling network services in which we multiplex all of the nodes in a given service configuration as virtual machines across a much smaller number of physical machines in a test harness. We show how to accurately scale CPU, network, and disk to provide the illusion that each VM matches a machine in the original service in terms of both available computing resources and communication behavior. We present the architecture and evaluation of a system we built to support such experimentation and discuss its limitations. We show that for a variety of services---including a commercial high-performance cluster-based file system---and resource utilization levels, DieCast matches the behavior of the original service while using a fraction of the physical resources.