Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
Sympathy for the sensor network debugger
Proceedings of the 3rd international conference on Embedded networked sensor systems
Towards a debugging system for sensor networks
International Journal of Network Management
Autonomous recovery in componentized Internet applications
Cluster Computing
Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems
IEEE Transactions on Dependable and Secure Computing
Detecting performance anomalies in global applications
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Towards fingerpointing in the Emulab dynamic distributed system
WORLDS'06 Proceedings of the 3rd conference on USENIX Workshop on Real, Large Distributed Systems - Volume 3
Architecture-driven diagnosis of performance failures in a token ring
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Why did my pc suddenly slow down?
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Fingerpointing correlated failures in replicated systems
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
SWEEPER: an efficient disaster recovery point identification mechanism
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Monitoring multi-tier clustered systems with invariant metric relationships
Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
System monitoring with metric-correlation models: problems and solutions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Suelo: human-assisted sensing for exploratory soil monitoring studies
Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
How to keep your head above water while detecting errors
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Heteroscedastic models to track relationships between management metrics
IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Ganesha: blackBox diagnosis of MapReduce systems
ACM SIGMETRICS Performance Evaluation Review
Assessing operational impact in enterprise systems by mining usage patterns
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
A statistical approach to detect application-level failures in internet services
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5
Improving wide-area distributed system availability
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
How to keep your head above water while detecting errors
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mochi: visual log-analysis based tools for debugging hadoop
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
A case for machine learning to optimize multicore performance
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Detecting user-visible failures in AJAX web applications by analyzing users' interaction behaviors
Proceedings of the IEEE/ACM international conference on Automated software engineering
Behavior-based problem localization for parallel file systems
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Analyzing web logs to detect user-visible failures
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Leveraging many simple statistical models to adaptively monitor software systems
International Journal of High Performance Computing and Networking
HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
A root cause localization model for large scale systems
HotDep'05 Proceedings of the First conference on Hot topics in system dependability
ASDF: an automated, online framework for diagnosing performance problems
Architecting dependable systems VII
Self-adaptive software system monitoring for performance anomaly localization
Proceedings of the 8th ACM international conference on Autonomic computing
Practical experiences with chronics discovery in large telecommunications systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
A self-adaptive monitoring framework for component-based software systems
ECSA'11 Proceedings of the 5th European conference on Software architecture
Practical experiences with chronics discovery in large telecommunications systems
ACM SIGOPS Operating Systems Review
Diagnosis of software failures using computational geometry
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Light-weight black-box failure detection for distributed systems
Proceedings of the 2012 workshop on Management of big data systems
Leveraging many simple statistical models to adaptively monitor software systems
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Making problem diagnosiswork for large-scale, production storage systems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
Most Internet services (e-commerce, search engines, etc.) suffer faults. Quickly detecting these faults can be the largest bottleneck in improving availability of the system. We present Pinpoint, a methodology for automating fault detection in Internet services by: 1) observing low-level internal structural behaviors of the service; 2) modeling the majority behavior of the system as correct; and 3) detecting anomalies in these behaviors as possible symptoms of failures. Without requiring any a priori application-specific information, Pinpoint correctly detected 89%-96% of major failures in our experiments, as compared with 20%-70% detected by current application-generic techniques.