WAP5: black-box performance debugging for wide-area systems

Authors:
Patrick Reynolds;Janet L. Wiener;Jeffrey C. Mogul;Marcos K. Aguilera;Amin Vahdat
Affiliations:
Duke University;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;HP Labs, Palo Alto, CA;UC San Diego
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 18
Cited 28

Interposition agents: transparently interposing user code at the system interface

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
On calibrating measurements of packet transit times

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
The Paradyn Parallel Performance Measurement Tool

Computer
DPM: A Measurement System for Distributed Programs

IEEE Transactions on Computers
The NetLogger Methodology for High Performance Distributed Systems Performance Analysis

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
ETE: A Customizable Approach to Measuring End-to-End Response Times and Their Components in Distributed Systems

ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scalability and accuracy in a large-scale network emulator

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Reliability and security in the CoDeeN content distribution network

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Designing a DHT for low latency and high throughput

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Democratizing content publication with coral

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Operating system support for planetary-scale network services

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Transparent result caching

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference

SC2D: an alternative to trace anonymization

Proceedings of the 2006 SIGCOMM workshop on Mining network data
Whodunit: transactional profiling for multi-tier applications

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
BorderPatrol: isolating events for black-box tracing

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Dynamic dependencies and performance improvement

LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Change is hard: adapting dependency graph models for unified diagnosis in wired/wireless networks

Proceedings of the 1st ACM workshop on Research on enterprise networking
Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
How to keep your head above water while detecting errors

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Macroscope: end-point approach to networked application dependency discovery

Proceedings of the 5th international conference on Emerging networking experiments and technologies
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
How to keep your head above water while detecting errors

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
FlowRank: ranking NetFlow records

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Automating network application dependency discovery: experiences, limitations, and new solutions

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Look who's talking: discovering dependencies between virtual machines using CPU utilization

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Mining netflow records for critical network activities

AIMS'10 Proceedings of the Mechanisms for autonomous management of networks and services, and 4th international conference on Autonomous infrastructure, management and security
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Rake: semantics assisted network-based tracing framework

Proceedings of the Nineteenth International Workshop on Quality of Service
OFRewind: enabling record and replay troubleshooting for networks

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
BotTrack: tracking botnets using NetFlow and PageRank

NETWORKING'11 Proceedings of the 10th international IFIP TC 6 conference on Networking - Volume Part I
Using link gradients to predict the impact of network latency on multitier applications

IEEE/ACM Transactions on Networking (TON)
NetPilot: automating datacenter network failure mitigation

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Net-cohort: detecting and managing VM ensembles in virtualized data centers

Proceedings of the 9th international conference on Autonomic computing
UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

Proceedings of the 9th international conference on Autonomic computing
NetPilot: automating datacenter network failure mitigation

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wide-area distributed applications are challenging to debug, optimize, and maintain. We present Wide-Area Project 5 (WAP5), which aims to make these tasks easier by exposing the causal structure of communication within an application and by exposing delays that imply bottlenecks. These bottlenecks might not otherwise be obvious, with or without the application's source code. Previous research projects have presented algorithms to reconstruct application structure and the corresponding timing information from black-box message traces of local-area systems. In this paper we present (1) a new algorithm for reconstructing application structure in both local- and wide-area distributed systems, (2) an infrastructure for gathering application traces in PlanetLab, and (3) our experiences tracing and analyzing three systems: CoDeeN and Coral, two content-distribution networks in PlanetLab; and Slurpee, an enterprise-scale incident-monitoring system.