More intervention now!

  • Authors:
  • Moises Goldszmidt;Rebecca Isaacs

  • Affiliations:
  • Microsoft Research;Microsoft Research

  • Venue:
  • HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Techniques for characterizing performance and diagnosing problems typically endeavor to minimize perturbation by measurements and data collection. We are making a call to do exactly the opposite. In order to characterize the behavior of a system and to perform root-cause analysis and answer what-if questions, we need to conduct active and systematic experiments on our systems, perhaps at the same time these systems are running. We argue that in distributed computing frameworks such as MapReduce, Dryad and Hadoop, the conditions are right for automatically conducting these experiments. At each stage there is a large number of nodes doing the same computation, hence providing a sound statistical population. Furthermore, we have the infrastructure in such systems to isolate and recreate the conditions of a run. In this paper we propose the missing piece: a blueprint of the causal interactions that can be used to plan these experiments and perform inferences about the results. Machine learning and statistical analysis give us the tools and algorithms for inducing such a causal blueprint from a combination of passive observations and active intervention.