Causality: models, reasoning, and inference
Causality: models, reasoning, and inference
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Triage: diagnosing production run failures at the user's site
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management
ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
Automatic exploration of datacenter performance regimes
ACDC '09 Proceedings of the 1st workshop on Automated control for datacenters and clouds
Quincy: fair scheduling for distributed computing clusters
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Ganesha: blackBox diagnosis of MapReduce systems
ACM SIGMETRICS Performance Evaluation Review
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
WebProphet: automating performance prediction for web services
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Hunting for problems with Artemis
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Look who's talking: discovering dependencies between virtual machines using CPU utilization
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Hi-index | 0.00 |
Techniques for characterizing performance and diagnosing problems typically endeavor to minimize perturbation by measurements and data collection. We are making a call to do exactly the opposite. In order to characterize the behavior of a system and to perform root-cause analysis and answer what-if questions, we need to conduct active and systematic experiments on our systems, perhaps at the same time these systems are running. We argue that in distributed computing frameworks such as MapReduce, Dryad and Hadoop, the conditions are right for automatically conducting these experiments. At each stage there is a large number of nodes doing the same computation, hence providing a sound statistical population. Furthermore, we have the infrastructure in such systems to isolate and recreate the conditions of a run. In this paper we propose the missing piece: a blueprint of the causal interactions that can be used to plan these experiments and perform inferences about the results. Machine learning and statistical analysis give us the tools and algorithms for inducing such a causal blueprint from a combination of passive observations and active intervention.