DieCast: testing distributed systems with an accurate scale model

Authors:
Diwaker Gupta;Kashi V. Vishwanath;Amin Vahdat
Affiliations:
University of California, San Diego;University of California, San Diego;University of California, San Diego
Venue:
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Year:
2008

Citing 23
Cited 22

MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Performance and scalability of EJB applications

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A solver for the network testbed mapping problem

ACM SIGCOMM Computer Communication Review
Memory resource management in VMware ESX server

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Resource overbooking and application profiling in shared hosting platforms

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalability and accuracy in a large-scale network emulator

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
SHRiNK: a method for enabling scaleable performance prediction and efficient network simulation

IEEE/ACM Transactions on Networking (TON)
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Monkey see, monkey do: a tool for TCP tracing and replaying

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Parallax: managing storage for a million machines

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Glacier: highly durable, decentralized storage despite massive correlated failures

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Quorum: flexible quality of service for internet services

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Model-based resource provisioning in a web service utility

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
To infinity and beyond: time-warped network emulation

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Dummynet and forward error correction

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference

Large-scale virtualization in the Emulab network testbed

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Synchronized network emulation: matching prototypes with complex simulations

ACM SIGMETRICS Performance Evaluation Review
Transparent checkpoints of closed distributed systems in Emulab

Proceedings of the 4th ACM European conference on Computer systems
Building an automated and self-configurable emulation testbed for grid applications

Software—Practice & Experience
The Heisenberg measuring uncertainty in lightweight virtualization testbeds

CSET'09 Proceedings of the 2nd conference on Cyber security experimentation and test
JustRunIt: experiment-based management of virtualized data centers

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Debugging large scale applications in a virtualized environment

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
SliceTime: a platform for scalable and accurate network emulation

Proceedings of the 8th USENIX conference on Networked systems design and implementation
VM-based slack emulation of large-scale systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
OFRewind: enabling record and replay troubleshooting for networks

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A Virtual Time System for OpenVZ-Based Network Emulations

PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
Efficiently Scheduling Multi-Core Guest Virtual Machines on Multi-Core Hosts in Network Simulation

PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
A new fast algorithm for connecting the INET simulation framework to applications in real-time

Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques
ShadowStream: performance evaluation as a capability in production internet live streaming networks

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Virtual Time Integration of Emulation and Parallel Simulation

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
ShadowStream: performance evaluation as a capability in production internet live streaming networks

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
TimeSync: enabling scalable, high-fidelity hybrid network emulation

Proceedings of the 15th ACM international conference on Modeling, analysis and simulation of wireless and mobile systems
Reproducible network experiments using container-based emulation

Proceedings of the 8th international conference on Emerging networking experiments and technologies
Validation of application behavior on a virtual time integrated network emulation testbed

Proceedings of the Winter Simulation Conference
Runtime performance and virtual network control alternatives in VM-based high-fidelity network simulations

Proceedings of the Winter Simulation Conference
Flow-based partitioning of network testbed experiments

Computer Networks: The International Journal of Computer and Telecommunications Networking
Exalt: empowering researchers to evaluate large-scale storage systems

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale network services can consist of tens of thousands of machines running thousands of unique software configurations spread across hundreds of physical networks. Testing such services for complex performance problems and configuration errors remains a difficult problem. Existing testing techniques, such as simulation or running smaller instances of a service, have limitations in predicting overall service behavior. Although technically and economically infeasible at this time, testing should ideally be performed at the same scale and with the same configuration as the deployed service. We present DieCast, an approach to scaling network services in which we multiplex all of the nodes in a given service configuration as virtual machines (VM) spread across a much smaller number of physical machines in a test harness. CPU, network, and disk are then accurately scaled to provide the illusion that each VM matches a machine from the original service in terms of both available computing resources and communication behavior to remote service nodes. We present the architecture and evaluation of a system to support such experimentation and discuss its limitations. We show that for a variety of services--including a commercial, high-performance, cluster-based file system--and resource utilization levels, DieCast matches the behavior of the original service while using a fraction of the physical resources.