More intervention now!

Authors:
Moises Goldszmidt;Rebecca Isaacs
Affiliations:
Microsoft Research;Microsoft Research
Venue:
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Year:
2011

Citing 14
Cited 0

Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Automatic exploration of datacenter performance regimes

ACDC '09 Proceedings of the 1st workshop on Automated control for datacenters and clouds
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
WebProphet: automating performance prediction for web services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Look who's talking: discovering dependencies between virtual machines using CPU utilization

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Techniques for characterizing performance and diagnosing problems typically endeavor to minimize perturbation by measurements and data collection. We are making a call to do exactly the opposite. In order to characterize the behavior of a system and to perform root-cause analysis and answer what-if questions, we need to conduct active and systematic experiments on our systems, perhaps at the same time these systems are running. We argue that in distributed computing frameworks such as MapReduce, Dryad and Hadoop, the conditions are right for automatically conducting these experiments. At each stage there is a large number of nodes doing the same computation, hence providing a sound statistical population. Furthermore, we have the infrastructure in such systems to isolate and recreate the conditions of a run. In this paper we propose the missing piece: a blueprint of the causal interactions that can be used to plan these experiments and perform inferences about the results. Machine learning and statistical analysis give us the tools and algorithms for inducing such a causal blueprint from a combination of passive observations and active intervention.