Towards fingerpointing in the Emulab dynamic distributed system

  • Authors:
  • Michael P. Kasick;Priya Narasimhan;Kevin Atkinson;Jay Lepreau

  • Affiliations:
  • Electrical & Computer Engineering Department, Carnegie Mellon University;Electrical & Computer Engineering Department, Carnegie Mellon University;School of Computing, University of Utah;School of Computing, University of Utah

  • Venue:
  • WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the large-scale Emulab distributed system, the many failure reports make skilled operator time a scarce and costly resource, as shown by statistics on failure frequency and root cause. We describe the lessons learned with error reporting in Emulab, along with the design, initial implementation, and results of a new local error-analysis approach that is running in production. Through structured error reporting, association of context with each error-type, and propagation of both error-type and context, our new local analysis locates the most prominent failure at the procedure, script, or session level. Evaluation of this local analysis for a targeted set of common Emulab failures suggests that this approach is generally accurate and will facilitate global fingerpointing, which will aim for reliable suggestions as to the root-cause of the failure at the system level.