Fay: extensible distributed tracing from kernels to clusters

  • Authors:
  • Úlfar Erlingsson;Marcus Peinado;Simon Peter;Mihai Budiu

  • Affiliations:
  • Google Inc.;Microsoft Research, Extreme Computing Group;ETH Zurich, Systems Group;Microsoft Research, Silicon Valley

  • Venue:
  • SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
  • Year:
  • 2011

Quantified Score

Hi-index 0.02

Visualization

Abstract

Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully-optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events. We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks. Fay shows that modern techniques for high-level querying and data-parallel processing of disaggregated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960's [15], Fay also demonstrates the efficiency and extensibility benefits of using safe, statically-verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.