Transparent checkpoints of closed distributed systems in Emulab

Authors:
Anton Burtsev;Prashanth Radhakrishnan;Mike Hibler;Jay Lepreau
Affiliations:
University of Utah, Salt Lake City, UT, USA;NetApp, Bangalore, India;University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA
Venue:
Proceedings of the 4th ACM European conference on Computer systems
Year:
2009

Citing 29
Cited 3

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Dummynet: a simple approach to the evaluation of network protocols

ACM SIGCOMM Computer Communication Review
Precision timestamping of network packets

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Isolating cause-effect chains from computer programs

Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A solver for the network testbed mapping problem

ACM SIGCOMM Computer Communication Review
Robust synchronization of software clocks across the internet

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
The design and implementation of Zap: a system for migrating computing environments

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Using model checking to find serious file system errors

ACM Transactions on Computer Systems (TOCS)
Debugging operating systems with time-traveling virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Measuring CPU overhead for I/O processing in the Xen virtual machine monitor

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
VMM-independent graphics acceleration

Proceedings of the 3rd international conference on Virtual execution environments
To infinity and beyond: time-warped network emulation

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Execution replay of multiprocessor virtual machines

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Parallax: virtual disks for virtual machines

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
DieCast: testing distributed systems with an accurate scale model

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Bridging the gap between software and hardware techniques for I/O virtualization

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
The flexlab approach to realistic evaluation of networked systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Life, death, and the critical transition: finding liveness bugs in systems code

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Perspectives of delegation in team-based distributed software development over the GENI infrastructure (NIER track)

Proceedings of the 33rd International Conference on Software Engineering
Beyond disk imaging for preserving user state in network testbeds

CSET'12 Proceedings of the 5th USENIX conference on Cyber Security Experimentation and Test
HotSnap: a hot distributed snapshot system for virtual machine cluster

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent tension between these goals, however, and in some aspects of the testbed's design, Emulab's implementers favored realism over control. Thus, Emulab provides wide-ranging control over an experiment's environment and initial conditions, but relatively little control over its execution--in particular, the ability to suspend, preempt, or replay the experiment. We have extended Emulab with a new means of control over experiment execution: the ability to cleanly checkpoint the execution of the set of nodes and networks that comprise an experiment. Conventional checkpoint mechanisms can easily degrade the fidelity of experiment results as a consequence of checkpoint downtimes, overheads of background state saving, and unintended distributed checkpoint synchronization effects. In this paper we demonstrate a checkpointing technique that is transparent with respect to the execution of the system under test, almost completely concealing the underlying checkpoint activity. Building on our checkpoint mechanism, we have implemented two powerful facilities for experiment execution control: the ability to preemptively swap-out experiments without losing their run-time state, and the ability to time-travel through the run of a system.