Transparent checkpoints of closed distributed systems in Emulab

  • Authors:
  • Anton Burtsev;Prashanth Radhakrishnan;Mike Hibler;Jay Lepreau

  • Affiliations:
  • University of Utah, Salt Lake City, UT, USA;NetApp, Bangalore, India;University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA

  • Venue:
  • Proceedings of the 4th ACM European conference on Computer systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent tension between these goals, however, and in some aspects of the testbed's design, Emulab's implementers favored realism over control. Thus, Emulab provides wide-ranging control over an experiment's environment and initial conditions, but relatively little control over its execution--in particular, the ability to suspend, preempt, or replay the experiment. We have extended Emulab with a new means of control over experiment execution: the ability to cleanly checkpoint the execution of the set of nodes and networks that comprise an experiment. Conventional checkpoint mechanisms can easily degrade the fidelity of experiment results as a consequence of checkpoint downtimes, overheads of background state saving, and unintended distributed checkpoint synchronization effects. In this paper we demonstrate a checkpointing technique that is transparent with respect to the execution of the system under test, almost completely concealing the underlying checkpoint activity. Building on our checkpoint mechanism, we have implemented two powerful facilities for experiment execution control: the ability to preemptively swap-out experiments without losing their run-time state, and the ability to time-travel through the run of a system.