Automated diagnosis without predictability is a recipe for failure

Authors:
Raja R. Sambasivan;Gregory R. Ganger
Affiliations:
Carnegie Mellon University;Carnegie Mellon University
Venue:
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Year:
2012

Citing 27
Cited 0

The Vision of Autonomic Computing

Computer
Robustness in Complex Systems

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Fail-Stutter Fault Tolerance

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Ursa minor: versatile cluster-based storage

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Argon: performance insulation for shared storage servers

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
DARC: dynamic analysis of root causes of latency distributions

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Behavior-based problem localization for parallel file systems

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
mClock: handling throughput variability for hypervisor IO scheduling

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scale and concurrency of GIGA+: file system directories with millions of files

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Dominant resource fairness: fair allocation of multiple resource types

Proceedings of the 8th USENIX conference on Networked systems design and implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Disks are like snowflakes: no two are alike

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Warehouse-Scale Computing: Entering the Teenage Decade

Proceedings of the 38th annual international symposium on Computer architecture
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automated management is critical to the success of cloud computing, given its scale and complexity. But, most systems do not satisfy one of the key properties required for automation: predictability, which in turn relies upon low variance. Most automation tools are not effective when variance is consistently high. Using automated performance diagnosis as a concrete example, this position paper argues that for automation to become a reality, system builders must treat variance as an important metric and make conscious decisions about where to reduce it. To help with this task, we describe a framework for reasoning about sources of variance in distributed systems and describe an example tool for helping identify them.