Pinpoint: Problem Determination in Large, Dynamic Internet Services

Authors:
Mike Y. Chen;Emre Kiciman;Eugene Fratkin;Armando Fox;Eric Brewer
Affiliations:
-;-;-;-;-
Venue:
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Year:
2002

Citing 0
Cited 163

Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
User-level internet path diagnosis

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
MADAPT: managed aspects for dynamic adaptation based on profiling techniques

ARM '04 Proceedings of the 3rd workshop on Adaptive and reflective middleware
Finding and preventing run-time error handling mistakes

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Cheap recovery: a key to self-managing state

ACM Transactions on Storage (TOS)
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support

LISA '03 Proceedings of the 17th USENIX conference on System administration
Combining High Level Symptom Descriptions and Low Level State Information for Configuration Fault Diagnosis

LISA '04 Proceedings of the 18th USENIX conference on System administration
Network-Based Problem Detection for Distributed Systems

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
TraceBack: first fault diagnosis by reconstruction of distributed control flow

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
IMPuLSE: integrated monitoring and profiling for large-scale environments

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Failure detection and localization in component based systems by online tracking

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Short term performance forecasting in enterprise systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Dealing with failures during failure recovery of distributed systems

DEAS '05 Proceedings of the 2005 workshop on Design and evolution of autonomic application software
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
A framework for detecting performance design and deployment antipatterns in component based enterprise systems

DSM '05 Proceedings of the 2nd international doctoral symposium on Middleware
HANet: a framework toward ultimately reliable network services

Journal of Systems and Software
An online evolutionary approach to developing internet services

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Request extraction in Magpie: events, schemas and temporal joins

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Automated Online Monitoring of Distributed Applications through External Monitors

IEEE Transactions on Dependable and Secure Computing
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Combining supervised and unsupervised monitoring for fault detection in distributed computing systems

Proceedings of the 2006 ACM symposium on Applied computing
Mapping moving landscapes by mining mountains of logs: novel techniques for dependency model generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Towards self-predicting systems: What if you could ask ‘what-if’?

The Knowledge Engineering Review
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Performance problem localization in self-healing, service-oriented systems using Bayesian networks

Proceedings of the 2007 ACM symposium on Applied computing
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
I/O system performance debugging using model-driven anomaly characterization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Causeway: operating system support for controlling and analyzing the execution of distributed programs

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic misconfiguration troubleshooting with peerpressure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MESO: Supporting Online Decision Making in Autonomic Computing Systems

IEEE Transactions on Knowledge and Data Engineering
Toward recovery-oriented computing

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping

IEEE Transactions on Knowledge and Data Engineering
Architecture-driven diagnosis of performance failures in a token ring

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
eMIVA: tool support for the instrumentation of critical distributed applications

ACM SIGMETRICS Performance Evaluation Review
Exceptional situations and program reliability

ACM Transactions on Programming Languages and Systems (TOPLAS)
Monitoring High-Dimensional Data for Failure Detection and Localization in Large-Scale Computing Systems

IEEE Transactions on Knowledge and Data Engineering
Vigilant: out-of-band detection of failures in virtual machines

ACM SIGOPS Operating Systems Review
BorderPatrol: isolating events for black-box tracing

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Investigating statistical machine learning as a tool for software development

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SPIKE: best practice generation for storage area networks

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Automatic software interference detection in parallel applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Automatic software fault localization using generic program invariants

Proceedings of the 2008 ACM symposium on Applied computing
Scalable problem localization for distributed systems: principles and practices

Proceedings of the 2nd international conference on Scalable information systems
Tracking in a spaghetti bowl: monitoring transactions using footprints

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DieCast: testing distributed systems with an accurate scale model

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Research and Implementation of Knowledge-Enhanced Information Services

ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
Mining Software Aging Patterns by Artificial Neural Networks

ANNPR '08 Proceedings of the 3rd IAPR workshop on Artificial Neural Networks in Pattern Recognition
Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Evolution of storage management: transforming raw data into information

IBM Journal of Research and Development
Galapagos: model-driven discovery of end-to-end application-storage relationships in distributed systems

IBM Journal of Research and Development
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Active Diagnosis of High-Level Faults in Distributed Internet Services

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Tracking transaction footprints for non-intrusive end-to-end monitoring

Cluster Computing
Causeway: support for controlling and analyzing the execution of multi-tier applications

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Improving the responsiveness of internet services with automatic cache placement

Proceedings of the 4th ACM European conference on Computer systems
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
AVA: automated interpretation of dynamically detected anomalies

Proceedings of the eighteenth international symposium on Software testing and analysis
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
A practical evaluation of spectrum-based fault localization

Journal of Systems and Software
EbAT: online methods for detecting utility cloud anomalies

Proceedings of the 6th Middleware Doctoral Symposium
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Online piece-wise linear approximation of numerical streams with precision guarantees

Proceedings of the VLDB Endowment
Problem classification method to enhance the ITIL incident and problem

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Performance debugging in data centers: doing more with less

COMSNETS'09 Proceedings of the First international conference on COMmunication Systems And NETworks
Diagnosis of recurrent faults using log files

CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
SelfTalk for Dena: query language and runtime support for evaluating system behavior

ACM SIGOPS Operating Systems Review
Dependency detection using a fuzzy engine

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Assessing operational impact in enterprise systems by mining usage patterns

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Safe and sound evolution with SONAR: sustainable optimization and navigation with aspects for system-wide reconciliation

Transactions on aspect-oriented software development IV
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Proceedings of the 7th international conference on Autonomic computing
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
A query language and runtime tool for evaluating behavior of multi-tier servers

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mochi: visual log-analysis based tools for debugging hadoop

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Volley: automated data placement for geo-distributed cloud services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
CLUEBOX: a performance log analyzer for automated troubleshooting

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Look who's talking: discovering dependencies between virtual machines using CPU utilization

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Diagnosing mobile applications in the wild

Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Combining hardware and software instrumentation to classify program executions

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
G-RCA: a generic root cause analysis platform for service quality management in large IP networks

Proceedings of the 6th International COnference
NEVERMIND, the problem is already fixed: proactively detecting and troubleshooting customer DSL problems

Proceedings of the 6th International COnference
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Experience mining Google's production console logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Analyzing web logs to detect user-visible failures

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Runtime verification in context: can optimizing error detection improve fault diagnosis?

RV'10 Proceedings of the First international conference on Runtime verification
MT-WAVE: profiling multi-tier web applications

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
WiDS checker: combating bugs in distributed systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Root-cause analysis of performance anomalies in web-based applications

Proceedings of the 2011 ACM Symposium on Applied Computing
ASDF: an automated, online framework for diagnosing performance problems

Architecting dependable systems VII
Exploiting Service Usage Information for Optimizing Server Resource Management

ACM Transactions on Internet Technology (TOIT)
A flexible architecture integrating monitoring and analytics for managing large-scale data centers

Proceedings of the 8th ACM international conference on Autonomic computing
A model for spectra-based software diagnosis

ACM Transactions on Software Engineering and Methodology (TOSEM)
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
BLR-D: applying bilinear logistic regression to factored diagnosis problems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Using link gradients to predict the impact of network latency on multitier applications

IEEE/ACM Transactions on Networking (TON)
Fast extraction of adaptive change point based patterns for problem resolution in enterprise systems

DSOM'06 Proceedings of the 17th IFIP/IEEE international conference on Distributed Systems: operations and management
I-queue: smart queues for service management

ICSOC'06 Proceedings of the 4th international conference on Service-Oriented Computing
Assisting failure diagnosis through filesystem instrumentation

Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
BLR-D: applying bilinear logistic regression to factored diagnosis problems

ACM SIGOPS Operating Systems Review
Exception-Handling bugs in java and a language extension to avoid them

Advanced Topics in Exception Handling Techniques
DejaVu: accelerating resource allocation in virtualized environments

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Measuring the dependability of web services for use in e-science experiments

ISAS'06 Proceedings of the Third international conference on Service Availability
On the specification of non-functional properties of systems by observation

MODELS'09 Proceedings of the 2009 international conference on Models in Software Engineering
Monere: monitoring of service compositions for failure diagnosis

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Control considerations for scalable event processing

DSOM'05 Proceedings of the 16th IFIP/IEEE Ambient Networks international conference on Distributed Systems: operations and Management
Causeway: support for controlling and analyzing the execution of multi-tier applications

Middleware'05 Proceedings of the ACM/IFIP/USENIX 6th international conference on Middleware
Infrastructure for automatic dynamic deployment of J2EE applications in distributed environments

CD'05 Proceedings of the Third international working conference on Component Deployment
Automated detection of performance regressions using statistical process control techniques

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Provenance for system troubleshooting

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Program spectra analysis with theory of evidence

Advances in Software Engineering - Special issue on Software Quality Assurance Methodologies and Techniques
DAPA: diagnosing application performance anomalies for virtualized infrastructures

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

The Journal of Supercomputing
Net-cohort: detecting and managing VM ensembles in virtualized data centers

Proceedings of the 9th international conference on Autonomic computing
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Cooperative component testing architecture in collaborating network environment

ATC'07 Proceedings of the 4th international conference on Autonomic and Trusted Computing
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems

International Journal of Adaptive, Resilient and Autonomic Systems
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
Metamorphic slice: An application in spectrum-based fault localization

Information and Software Technology
Performance problem diagnostics by systematic experimentation

Proceedings of the 18th international doctoral symposium on Components and architecture
Adaptive monitoring of web-based applications: a performance study

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Spectral debugging: how much better can we do?

ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators

Proceedings of the 2013 International Symposium on Software Testing and Analysis
Using automated program repair for evaluating the effectiveness of fault localization techniques

Proceedings of the 2013 International Symposium on Software Testing and Analysis
Supporting swift reaction: automatically uncovering performance problems by systematic experiments

Proceedings of the 2013 International Conference on Software Engineering
Autonomous, failure-resilient orchestration of distributed discrete event simulations

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
An online service-oriented performance profiling tool for cloud computing systems

Frontiers of Computer Science: Selected Publications from Chinese Universities
Current challenges in automatic software repair

Software Quality Control
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization

ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
DeepDive: transparently identifying and managing performance interference in virtualized environments

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Prevalence of coincidental correctness and mitigation of its impact on fault localization

ACM Transactions on Software Engineering and Methodology (TOSEM)
Seeing through black boxes: Tracking transactions through queues under monitoring resource constraints

Performance Evaluation
Slice-based statistical fault localization

Journal of Systems and Software
Component survivability at runtime for mission-critical distributed systems

The Journal of Supercomputing
HSFal: Effective fault localization using hybrid spectrum of full slices and execution slices

Journal of Systems and Software
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as e-commerce systems. In this paper, we present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault. To validate our methodology, we have implemented Pinpoint, a framework for root cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluatePinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.