Performance debugging for distributed systems of black boxes

Authors:
Marcos K. Aguilera;Jeffrey C. Mogul;Janet L. Wiener;Patrick Reynolds;Athicha Muthitacharoen
Affiliations:
HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;Duke University;MIT Lab for Computer Science, MA
Venue:
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Year:
2003

Citing 13
Cited 192

Automated packet trace analysis of TCP implementations

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
BLT: Bi-layer tracing of HTTP and TCP&slash;IP

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An open graph visualization system and its applications to software engineering

Software—Practice & Experience - Special issue on discrete algorithm engineering
A non-instrusive, wavelet-based approach to detecting network performance problems

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
DPM: A Measurement System for Distributed Programs

IEEE Transactions on Computers
Learning Stochastic Regular Grammars by Means of a State Merging Method

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Automatic Generation of a Software Performance Model Using an Object-Oriented Prototype

MASCOTS '95 Proceedings of the 3rd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Gprof: A call graph execution profiler

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
The NetLogger Methodology for High Performance Distributed Systems Performance Analysis

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
ETE: A Customizable Approach to Measuring End-to-End Response Times and Their Components in Distributed Systems

ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Detecting stepping stones

SSYM'00 Proceedings of the 9th conference on USENIX Security Symposium - Volume 9

User-level internet path diagnosis

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
TraceBack: first fault diagnosis by reconstruction of distributed control flow

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
IMPuLSE: integrated monitoring and profiling for large-scale environments

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Deconstructing Commodity Storage Clusters

Proceedings of the 32nd annual international symposium on Computer Architecture
Failure detection and localization in component based systems by online tracking

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Daphne: performance debugging using model-driven anomaly characterization

Proceedings of the twentieth ACM symposium on Operating systems principles
Web services navigator: visualizing the execution of web services

IBM Systems Journal
Autonomous recovery in componentized Internet applications

Cluster Computing
Utilification

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Request extraction in Magpie: events, schemas and temporal joins

Proceedings of the 11th workshop on ACM SIGOPS European workshop
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Automated Online Monitoring of Distributed Applications through External Monitors

IEEE Transactions on Dependable and Secure Computing
Challenges in managing dependable data systems

ACM SIGMETRICS Performance Evaluation Review - Design, implementation, and performance of storage systems
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Architecture-driven platform independent deterministic replay for distributed hard real-time systems

Proceedings of the ISSTA 2006 workshop on Role of software architecture for testing and analysis
Execution patterns for visualizing web services

SoftVis '06 Proceedings of the 2006 ACM symposium on Software visualization
Mapping moving landscapes by mining mountains of logs: novel techniques for dependency model generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Towards self-predicting systems: What if you could ask ‘what-if’?

The Knowledge Engineering Review
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems

IEEE Transactions on Dependable and Secure Computing
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
TestEJB: response time measurement and call dependency tracing for EJBs

MAI '07 Proceedings of the 1st workshop on Middleware-application interaction: in conjunction with Euro-Sys 2007
Performance problem localization in self-healing, service-oriented systems using Bayesian networks

Proceedings of the 2007 ACM symposium on Applied computing
Improved error reporting for software that uses black-box components

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
I/O system performance debugging using model-driven anomaly characterization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Three research challenges at the intersection of machine learning, statistical induction, and systems

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Causeway: operating system support for controlling and analyzing the execution of distributed programs

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Configuration debugging as search: finding the needle in the haystack

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Whodunit: transactional profiling for multi-tier applications

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Observer: keeping system models from becoming obsolete

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
/*icomment: bugs or bad comments?*/

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
AutoBash: improving configuration management with operating system causality analysis

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Operating system profiling via latency analysis

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
Architecture-driven diagnosis of performance failures in a token ring

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
eMIVA: tool support for the instrumentation of critical distributed applications

ACM SIGMETRICS Performance Evaluation Review
Hardware counter driven on-the-fly request signatures

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Understanding and visualizing full systems with data flow tomography

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
CAMP: a common API for measuring performance

LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Hang analysis: fighting responsiveness bugs

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
BorderPatrol: isolating events for black-box tracing

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Live monitoring: using adaptive instrumentation and analysis to debug and maintain web applications

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SPIKE: best practice generation for storage area networks

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable problem localization for distributed systems: principles and practices

Proceedings of the 2nd international conference on Scalable information systems
Tracking in a spaghetti bowl: monitoring transactions using footprints

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Ironmodel: robust performance models in the wild
DieCast: testing distributed systems with an accurate scale model

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Remote profiling of resource constraints of web servers using mini-flash crowds

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A dollar from 15 cents: cross-platform management for internet services

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Using causality to diagnose configuration bugs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Towards realistic file-system benchmarks with CodeMRI

ACM SIGMETRICS Performance Evaluation Review
Dustminer: troubleshooting interactive complexity bugs in sensor networks

Proceedings of the 6th ACM conference on Embedded network sensor systems
A model of process documentation to determine provenance in mash-ups

ACM Transactions on Internet Technology (TOIT)
Galapagos: model-driven discovery of end-to-end application-storage relationships in distributed systems

IBM Journal of Research and Development
Active Diagnosis of High-Level Faults in Distributed Internet Services

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
The Study of Multi-agent Network Flow Architecture for Application Performance Evaluation

KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Dynamic dependencies and performance improvement

LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Causeway: support for controlling and analyzing the execution of multi-tier applications

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
mBrace: action-based performance monitoring of multi-tier web applications

Proceedings of the Third Workshop on Dependable Distributed Data Management
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
Configuration-space performance anomaly depiction

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
AdaptGuard: guarding adaptive systems from instability

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
End-to-end performance forecasting: finding bottlenecks before they happen

Proceedings of the 36th annual international symposium on Computer architecture
RPC chains: efficient client-server communication in geodistributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Finding Symbolic Bug Patterns in Sensor Networks

DCOSS '09 Proceedings of the 5th IEEE International Conference on Distributed Computing in Sensor Systems
Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Automated anomaly detection and performance modeling of enterprise applications

ACM Transactions on Computer Systems (TOCS)
How to keep your head above water while detecting errors

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Stateful error detection in high throughput applications

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Macroscope: end-point approach to networked application dependency discovery

Proceedings of the 5th international conference on Emerging networking experiments and technologies
EbAT: online methods for detecting utility cloud anomalies

Proceedings of the 6th Middleware Doctoral Symposium
Managing massive time series streams with multi-scale compressed trickles

Proceedings of the VLDB Endowment
Performance debugging in data centers: doing more with less

COMSNETS'09 Proceedings of the First international conference on COMmunication Systems And NETworks
Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Do you know your IQ?: a research agenda for information quality in systems

ACM SIGMETRICS Performance Evaluation Review
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Run-time dependency tracking in data centers

Proceedings of the Third Annual ACM Bangalore Conference
SNTS: sensor network troubleshooting suite

DCOSS'07 Proceedings of the 3rd IEEE international conference on Distributed computing in sensor systems
Passive inspection of sensor networks

DCOSS'07 Proceedings of the 3rd IEEE international conference on Distributed computing in sensor systems
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
Are clouds ready for large distributed applications?

ACM SIGOPS Operating Systems Review
CloudXplor: a tool for configuration planning in clouds based on empirical data

Proceedings of the 2010 ACM Symposium on Applied Computing
Assessing operational impact in enterprise systems by mining usage patterns

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Proceedings of the 7th international conference on Autonomic computing
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
How to keep your head above water while detecting errors

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
FlowRank: ranking NetFlow records

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Measurement and diagnosis of address misconfigured P2P traffic

INFOCOM'10 Proceedings of the 29th conference on Information communications
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications

ACM Transactions on the Web (TWEB)
Cloudward bound: planning for beneficial migration of enterprise applications to the cloud

Proceedings of the ACM SIGCOMM 2010 conference
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mochi: visual log-analysis based tools for debugging hadoop

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Towards automatic inference of task hierarchies in complex systems

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Automating network application dependency discovery: experiences, limitations, and new solutions

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
CLUEBOX: a performance log analyzer for automated troubleshooting

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Look who's talking: discovering dependencies between virtual machines using CPU utilization

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Diagnosing mobile applications in the wild

Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Performance analysis of idle programs

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Mining netflow records for critical network activities

AIMS'10 Proceedings of the Mechanisms for autonomous management of networks and services, and 4th international conference on Autonomous infrastructure, management and security
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Behavior-based problem localization for parallel file systems

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Visual debugging for stream processing applications

RV'10 Proceedings of the First international conference on Runtime verification
Improving software diagnosability via log enhancement

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
MT-WAVE: profiling multi-tier web applications

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
DieCast: Testing Distributed Systems with an Accurate Scale Model

ACM Transactions on Computer Systems (TOCS)
QoSaaS: quality of service as a service

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
WiDS checker: combating bugs in distributed systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
ASDF: an automated, online framework for diagnosing performance problems

Architecting dependable systems VII
LIME: a framework for debugging load imbalance in multi-threaded execution

Proceedings of the 33rd International Conference on Software Engineering
Rake: semantics assisted network-based tracing framework

Proceedings of the Nineteenth International Workshop on Quality of Service
A flexible architecture integrating monitoring and analytics for managing large-scale data centers

Proceedings of the 8th ACM international conference on Autonomic computing
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
OFRewind: enabling record and replay troubleshooting for networks

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
BotTrack: tracking botnets using NetFlow and PageRank

NETWORKING'11 Proceedings of the 10th international IFIP TC 6 conference on Networking - Volume Part I
Ranking the importance of alerts for problem determination in large computer systems

Cluster Computing
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
BLR-D: applying bilinear logistic regression to factored diagnosis problems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery

Proceedings of the 2nd ACM Symposium on Cloud Computing
Using link gradients to predict the impact of network latency on multitier applications

IEEE/ACM Transactions on Networking (TON)
Bootstrapping energy debugging on smartphones: a first look at energy bugs in mobile devices

Proceedings of the 10th ACM Workshop on Hot Topics in Networks
Fast extraction of adaptive change point based patterns for problem resolution in enterprise systems

DSOM'06 Proceedings of the 17th IFIP/IEEE international conference on Distributed Systems: operations and management
BLR-D: applying bilinear logistic regression to factored diagnosis problems

ACM SIGOPS Operating Systems Review
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Dataflow Tomography: Information Flow Tracking For Understanding and Visualizing Full Systems

ACM Transactions on Architecture and Code Optimization (TACO)
Modeling the parallel execution of black-box services

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Causeway: support for controlling and analyzing the execution of multi-tier applications

Middleware'05 Proceedings of the ACM/IFIP/USENIX 6th international conference on Middleware
Modellus: Automated modeling of complex internet data center applications

ACM Transactions on the Web (TWEB)
DAPA: diagnosing application performance anomalies for virtualized infrastructures

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding and detecting real-world performance bugs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Automatically finding performance problems with feedback-directed learning software testing

Proceedings of the 34th International Conference on Software Engineering
What is my program doing? program dynamics in programmer's terms

RV'11 Proceedings of the Second international conference on Runtime verification
NetPilot: automating datacenter network failure mitigation

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Net-cohort: detecting and managing VM ensembles in virtualized data centers

Proceedings of the 9th international conference on Autonomic computing
PowerTracer: tracing requests in multi-tier services to diagnose energy inefficiency

Proceedings of the 9th international conference on Autonomic computing
NetPilot: automating datacenter network failure mitigation

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Collaborative energy debugging for mobile devices

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
AppInsight: mobile app performance monitoring in the wild

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Online black-box failure prediction for mission critical distributed systems

SAFECOMP'12 Proceedings of the 31st international conference on Computer Safety, Reliability, and Security
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems

International Journal of Adaptive, Resilient and Autonomic Systems
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
A study of unpredictability in fault-tolerant middleware

Computer Networks: The International Journal of Computer and Telecommunications Networking
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Analytical modeling for what-if analysis in complex cloud computing applications

ACM SIGMETRICS Performance Evaluation Review
Adaptive monitoring of web-based applications: a performance study

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
A characteristic study on failures of production distributed data-parallel programs

Proceedings of the 2013 International Conference on Software Engineering
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
Asynchronous intrusion recovery for interconnected web services

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ROOT: replaying multithreaded traces with resource-oriented ordering

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing
Troubleshooting interactive complexity bugs in wireless sensor networks using data mining techniques

ACM Transactions on Sensor Networks (TOSN)
Verifiable network function outsourcing: requirements, challenges, and roadmap

Proceedings of the 2013 workshop on Hot topics in middleboxes and network function virtualization
Comprehending performance from real-world execution traces: a device-driver case

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Seeing through black boxes: Tracking transactions through queues under monitoring resource constraints

Performance Evaluation
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.