Detecting application-level failures in component-based Internet services

Authors:
E. Kiciman;A. Fox
Affiliations:
Dept. of Comput. Sci., Stanford Univ., CA, USA;-
Venue:
IEEE Transactions on Neural Networks
Year:
2005

Citing 0
Cited 47

Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Sympathy for the sensor network debugger

Proceedings of the 3rd international conference on Embedded networked sensor systems
Towards a debugging system for sensor networks

International Journal of Network Management
Autonomous recovery in componentized Internet applications

Cluster Computing
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems

IEEE Transactions on Dependable and Secure Computing
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards fingerpointing in the Emulab dynamic distributed system

WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
Architecture-driven diagnosis of performance failures in a token ring

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SWEEPER: an efficient disaster recovery point identification mechanism

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Monitoring multi-tier clustered systems with invariant metric relationships

Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
System monitoring with metric-correlation models: problems and solutions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Suelo: human-assisted sensing for exploratory soil monitoring studies

Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
How to keep your head above water while detecting errors

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Heteroscedastic models to track relationships between management metrics

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Ganesha: blackBox diagnosis of MapReduce systems

ACM SIGMETRICS Performance Evaluation Review
Assessing operational impact in enterprise systems by mining usage patterns

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
A statistical approach to detect application-level failures in internet services

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5
Improving wide-area distributed system availability

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
How to keep your head above water while detecting errors

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mochi: visual log-analysis based tools for debugging hadoop

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Detecting user-visible failures in AJAX web applications by analyzing users' interaction behaviors

Proceedings of the IEEE/ACM international conference on Automated software engineering
Behavior-based problem localization for parallel file systems

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Analyzing web logs to detect user-visible failures

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Leveraging many simple statistical models to adaptively monitor software systems

International Journal of High Performance Computing and Networking
Software error early detection system based on run-time statistical analysis of function return values

HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
A root cause localization model for large scale systems

HotDep'05 Proceedings of the First conference on Hot topics in system dependability
ASDF: an automated, online framework for diagnosing performance problems

Architecting dependable systems VII
Self-adaptive software system monitoring for performance anomaly localization

Proceedings of the 8th ACM international conference on Autonomic computing
Practical experiences with chronics discovery in large telecommunications systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
A self-adaptive monitoring framework for component-based software systems

ECSA'11 Proceedings of the 5th European conference on Software architecture
Practical experiences with chronics discovery in large telecommunications systems

ACM SIGOPS Operating Systems Review
Diagnosis of software failures using computational geometry

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
Leveraging many simple statistical models to adaptively monitor software systems

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most Internet services (e-commerce, search engines, etc.) suffer faults. Quickly detecting these faults can be the largest bottleneck in improving availability of the system. We present Pinpoint, a methodology for automating fault detection in Internet services by: 1) observing low-level internal structural behaviors of the service; 2) modeling the majority behavior of the system as correct; and 3) detecting anomalies in these behaviors as possible symptoms of failures. Without requiring any a priori application-specific information, Pinpoint correctly detected 89%-96% of major failures in our experiments, as compared with 20%-70% detected by current application-generic techniques.