Towards fingerpointing in the Emulab dynamic distributed system

Authors:
Michael P. Kasick;Priya Narasimhan;Kevin Atkinson;Jay Lepreau
Affiliations:
Electrical & Computer Engineering Department, Carnegie Mellon University;Electrical & Computer Engineering Department, Carnegie Mellon University;School of Computing, University of Utah;School of Computing, University of Utah
Venue:
WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
Year:
2006

Citing 6
Cited 1

An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Advanced tools for operators at amazon.com

HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the large-scale Emulab distributed system, the many failure reports make skilled operator time a scarce and costly resource, as shown by statistics on failure frequency and root cause. We describe the lessons learned with error reporting in Emulab, along with the design, initial implementation, and results of a new local error-analysis approach that is running in production. Through structured error reporting, association of context with each error-type, and propagation of both error-type and context, our new local analysis locates the most prominent failure at the procedure, script, or session level. Evaluation of this local analysis for a targeted set of common Emulab failures suggests that this approach is generally accurate and will facilitate global fingerpointing, which will aim for reliable suggestions as to the root-cause of the failure at the system level.