Pip: detecting the unexpected in distributed systems

Authors:
Patrick Reynolds;Charles Killian;Janet L. Wiener;Jeffrey C. Mogul;Mehul A. Shah;Amin Vahdat
Affiliations:
Duke University;University of California, San Diego;Hewlett-Packard Laboratories, Palo Alto;Hewlett-Packard Laboratories, Palo Alto;Hewlett-Packard Laboratories, Palo Alto;University of California, San Diego
Venue:
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Year:
2006

Citing 0
Cited 69

WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
A: an assertion language for distributed systems

Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Towards fingerpointing in the Emulab dynamic distributed system

WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
Categorizing and differencing system behaviours

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Observer: keeping system models from becoming obsolete

HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Flight data recorder: monitoring persistent-state interactions to improve systems management

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
BorderPatrol: isolating events for black-box tracing

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Ironmodel: robust performance models in the wild
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
What's going on?: learning communication rules in edge networks

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Declarative Network Verification

PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Improving the responsiveness of internet services with automatic cache placement

Proceedings of the 4th ACM European conference on Computer systems
Configuration-space performance anomaly depiction

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Live Debugging of Distributed Systems

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Building reliable large-scale distributed systems: when theory meets practice

ACM SIGACT News
Macroscope: end-point approach to networked application dependency discovery

Proceedings of the 5th international conference on Emerging networking experiments and technologies
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
Barricade: defending systems against operator mistakes

Proceedings of the 5th European conference on Computer systems
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
Analyzing blocking to debug performance problems on multi-core systems

ACM SIGOPS Operating Systems Review
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Towards automatic inference of task hierarchies in complex systems

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Quanto: tracking energy in networked embedded systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
CLUEBOX: a performance log analyzer for automated troubleshooting

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Look who's talking: discovering dependencies between virtual machines using CPU utilization

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Scoped identifiers for efficient bit aligned logging

Proceedings of the Conference on Design, Automation and Test in Europe
Finding latent performance bugs in systems implementations

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Profiling network performance for multi-tier data center applications

Proceedings of the 8th USENIX conference on Networked systems design and implementation
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
WiDS checker: combating bugs in distributed systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Mining temporal invariants from partially ordered logs

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Secure network provenance

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Mining temporal invariants from partially ordered logs

ACM SIGOPS Operating Systems Review
Modeling the parallel execution of black-box services

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Capturing performance assumptions using stochastic performance logic

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Net-cohort: detecting and managing VM ensembles in virtualized data centers

Proceedings of the 9th international conference on Autonomic computing
Detecting problematic message sequences and frequencies in distributed systems

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
MDiag: Mobility-assisted diagnosis for wireless sensor networks

Journal of Network and Computer Applications
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing
Aligning event logs and process models for multi-perspective conformance checking: an approach based on integer linear programming

BPM'13 Proceedings of the 11th international conference on Business Process Management
DEFINED: deterministic execution for interactive control-plane debugging

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Challenges to error diagnosis in hadoop ecosystems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system's behavior and the programmer's assumptions about that behavior. We present Pip, an infrastructure for comparing actual behavior and expected behavior to expose structural errors and performance problems in distributed systems. Pip allows programmers to express, in a declarative language, expectations about the system's communications structure, timing, and resource consumption. Pip includes system instrumentation and annotation tools to log actual system behavior, and visualization and query tools for exploring expected and unexpected behavior. Pip allows a developer to quickly understand and debug both familiar and unfamiliar systems. We applied Pip to several applications, including FAB, SplitStream, Bullet, and RanSub. We generated most of the instrumentation for all four applications automatically. We found the needed expectations easy to write, starting in each case with automatically generated expectations. Pip found unexpected behavior in each application, and helped to isolate the causes of poor performance and incorrect behavior.