Path-based faliure and evolution management

Authors:
Mike Y. Chen;Anthony Accardi;Emre Kiciman;Jim Lloyd;Dave Patterson;Armando Fox;Eric Brewer
Affiliations:
UC Berkeley, Tellme Networks, Standford University, eBay Inc.;UC Berkeley, Tellme Networks, Standford University, eBay Inc.;UC Berkeley, Tellme Networks, Standford University, eBay Inc.;UC Berkeley, Tellme Networks, Standford University, eBay Inc.;UC Berkeley, Tellme Networks, Standford University, eBay Inc.;UC Berkeley, Tellme Networks, Standford University, eBay Inc.;UC Berkeley, Tellme Networks, Standford University, eBay Inc.
Venue:
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Year:
2004

Citing 28
Cited 86

Increasing network throughput by integrating protocol layers

IEEE/ACM Transactions on Networking (TON)
Experiences with building distributed debuggers

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Making paths explicit in the Scout operating system

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Improving data-flow analysis with path profiles

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Whole program paths

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Internet service performance failure detection

ACM SIGMETRICS Performance Evaluation Review
Bro: a system for detecting network intruders in real-time

Computer Networks: The International Journal of Computer and Telecommunications Networking
The Ninja architecture for robust Internet-scale systems and services373423

Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Lessons from Giant-Scale Services

IEEE Internet Computing
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
WebMon: A Performance Profiler for Web Transactions

WECWIS '02 Proceedings of the Fourth IEEE International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS'02)
Web usage mining: discovery and applications of usage patterns from Web data

ACM SIGKDD Explorations Newsletter
ETE: A Customizable Approach to Measuring End-to-End Response Times and Their Components in Distributed Systems

ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
Intrusion Detection via Static Analysis

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Fine-grained network time synchronization using reference broadcasts

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Checking system rules using system-specific, programmer-written compiler extensions

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Data mining approaches for intrusion detection

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Virtual services: a new abstraction for server consolidation

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Intrusion detection using sequences of system calls

Journal of Computer Security
High speed and robust event correlation

IEEE Communications Magazine

Deconstructing Commodity Storage Clusters

Proceedings of the 32nd annual international symposium on Computer Architecture
Better performance or better manageability?

DEAS '05 Proceedings of the 2005 workshop on Design and evolution of autonomic application software
Daphne: performance debugging using model-driven anomaly characterization

Proceedings of the twentieth ACM symposium on Operating systems principles
Utilification

Proceedings of the 11th workshop on ACM SIGOPS European workshop
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using queries for distributed monitoring and forensics

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Configuration debugging as search: finding the needle in the haystack

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Operating system profiling via latency analysis

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Architecture-driven diagnosis of performance failures in a token ring

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Hardware counter driven on-the-fly request signatures

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Hang analysis: fighting responsiveness bugs

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
30 seconds is not enough!: a study of operating system timer usage

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Tracking in a spaghetti bowl: monitoring transactions using footprints

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An Operating System Architecture for Future Information Appliances

SEUS '08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems
Automatic request categorization in internet services

ACM SIGMETRICS Performance Evaluation Review
Towards realistic file-system benchmarks with CodeMRI

ACM SIGMETRICS Performance Evaluation Review
Reliable framework for RFID devices

Proceedings of the 5th Middleware doctoral symposium
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis

FAST '09 Proccedings of the 7th conference on File and storage technologies
Configuration-space performance anomaly depiction

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Automated anomaly detection and performance modeling of enterprise applications

ACM Transactions on Computer Systems (TOCS)
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
How to keep your head above water while detecting errors

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Stateful error detection in high throughput applications

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Macroscope: end-point approach to networked application dependency discovery

Proceedings of the 5th international conference on Emerging networking experiments and technologies
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Towards a Fault-Tolerant Architecture for Enterprise Application Integration Solutions

OTM '09 Proceedings of the Confederated International Workshops and Posters on On the Move to Meaningful Internet Systems: ADI, CAMS, EI2N, ISDE, IWSSA, MONET, OnToContent, ODIS, ORM, OTM Academy, SWWS, SEMELS, Beyond SAWSDL, and COMBEK 2009
Efficient decision tree construction for mining time-varying data streams

CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
How to keep your head above water while detecting errors

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Towards automatic inference of task hierarchies in complex systems

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Automating network application dependency discovery: experiences, limitations, and new solutions

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Quanto: tracking energy in networked embedded systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Detecting user-visible failures in AJAX web applications by analyzing users' interaction behaviors

Proceedings of the IEEE/ACM international conference on Automated software engineering
Gestalt: integrated support for implementation and analysis in machine learning

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Lowering the barrier to applying machine learning

UIST '10 Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology
Diagnosing mobile applications in the wild

Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Mining netflow records for critical network activities

AIMS'10 Proceedings of the Mechanisms for autonomous management of networks and services, and 4th international conference on Autonomous infrastructure, management and security
F007: finding rediscovered faults from the field using function-level failed traces of software in the field

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Experience mining Google's production console logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Analyzing web logs to detect user-visible failures

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Profiling network performance for multi-tier data center applications

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A root cause localization model for large scale systems

HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Efficient decision tree re-alignment for clustering time-changing data streams

From active data management to event-based systems and more
Clustering performance anomalies in web applications based on root causes

Proceedings of the 8th ACM international conference on Autonomic computing
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Ranking the importance of alerts for problem determination in large computer systems

Cluster Computing
A proposal to detect errors in Enterprise Application Integration solutions

Journal of Systems and Software
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows

Information Systems Frontiers
What is my program doing? program dynamics in programmer's terms

RV'11 Proceedings of the Second international conference on Runtime verification
NetPilot: automating datacenter network failure mitigation

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Automated diagnosis without predictability is a recipe for failure

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
NetPilot: automating datacenter network failure mitigation

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
AppInsight: mobile app performance monitoring in the wild

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
On the accurate identification of network service dependencies in distributed systems

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Timecard: controlling user-perceived delays in server-based mobile applications

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing
Seeing through black boxes: Tracking transactions through queues under monitoring resource constraints

Performance Evaluation
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of the components themselves. Paths record component performance and interactions, are user- and request-centric, and occur in sufficient volume to enable statistical analysis, all in a way that is easily reusable across applications. Automated statistical analysis of multiple paths allows for the detection and diagnosis of complex failures and the assessment of evolution issues. In particular, our approach enables significantly stronger capabilities in failure detection, failure diagnosis, impact analysis, and understanding system evolution. We explore these capabilities with three real implementations, two of which service millions of requests per day. Our contributions include the approach; the maintainable, extensible, and reusable architecture; the various statistical analysis engines; and the discussion of our experience with a high-volume production service over several years.