Diagnosing performance changes by comparing request flows

Authors:
Raja R. Sambasivan;Alice X. Zheng;Michael De Rosa;Elie Krevat;Spencer Whitman;Michael Stroucken;William Wang;Lianghong Xu;Gregory R. Ganger
Affiliations:
Carnegie Mellon University;Microsoft Research;Google;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University
Venue:
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Year:
2011

Citing 27
Cited 20

The visual display of quantitative information

The visual display of quantitative information
Gprof: A call graph execution profiler

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
prefuse: a toolkit for interactive information visualization

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Dynamic instrumentation of production systems

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Ursa minor: versatile cluster-based storage

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Whodunit: transactional profiling for multi-tier applications

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Categorizing and differencing system behaviours

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Ironmodel: robust performance models in the wild
DARC: dynamic analysis of root causes of latency distributions

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Informed data distribution selection in a self-predicting storage system

ICAC '06 Proceedings of the 2006 IEEE International Conference on Autonomic Computing
OptiScope: Performance Accountability for Optimizing Compilers

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Bagging, boosting, and C4.S

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
Practical experiences with chronics discovery in large telecommunications systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Practical experiences with chronics discovery in large telecommunications systems

ACM SIGOPS Operating Systems Review
Modeling the parallel execution of black-box services

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Understanding performance modeling for modular mobile-cloud applications

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Automated diagnosis without predictability is a recipe for failure

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Theia: visual signatures for problem diagnosis in large hadoop clusters

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Understanding latency variations of black box services

Proceedings of the 22nd international conference on World Wide Web
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
IOFlow: a software-defined storage architecture

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
Robust assessment of changes in cellular networks

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The causes of performance changes in a distributed system often elude even its developers. This paper develops a new technique for gaining insight into such changes: comparing request flows from two executions (e.g., of two system versions or time periods). Building on end-to-end request-flow tracing within and across components, algorithms are described for identifying and ranking changes in the flow and/or timing of request processing. The implementation of these algorithms in a tool called Spectroscope is evaluated. Six case studies are presented of using Spectroscope to diagnose performance changes in a distributed storage service caused by code changes, configuration modifications, and component degradations, demonstrating the value and efficacy of comparing request flows. Preliminary experiences of using Spectroscope to diagnose performance changes within select Google services are also presented.