Root cause detection in a service-oriented architecture

Authors:
Myunghwan Kim;Roshan Sumbaly;Sam Shah
Affiliations:
Stanford University, Stanford, USA;LinkedIn Corporation, Mountain View, USA;LinkedIn Corporation, Mountain View, USA
Venue:
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Year:
2013

Citing 23
Cited 1

Detection of abrupt changes: theory and application

Detection of abrupt changes: theory and application
An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation

Journal of Network and Systems Management
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
The link prediction problem for social networks

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Weighted PageRank Algorithm

CNSR '04 Proceedings of the Second Annual Conference on Communication Networks and Services Research
Failure Diagnosis Using Decision Trees

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Service oriented architectures: approaches, technologies and research issues

The VLDB Journal — The International Journal on Very Large Data Bases
Machine learning approaches to network anomaly detection

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Dustminer: troubleshooting interactive complexity bugs in sensor networks

Proceedings of the 6th ACM conference on Embedded network sensor systems
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems Based on Timing Behavior Anomaly Correlation

CSMR '09 Proceedings of the 2009 European Conference on Software Maintenance and Reengineering
Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Automated anomaly detection and performance modeling of enterprise applications

ACM Transactions on Computer Systems (TOCS)
Sensor faults: Detection methods and prevalence in real-world datasets

ACM Transactions on Sensor Networks (TOSN)
Data stream anomaly detection through principal subspace tracking

Proceedings of the 2010 ACM Symposium on Applied Computing
DIAMOND: Correlation-Based Anomaly Monitoring Daemon for DIME

ISM '10 Proceedings of the 2010 IEEE International Symposium on Multimedia
A flexible architecture integrating monitoring and analytics for managing large-scale data centers

Proceedings of the 8th ACM international conference on Autonomic computing
Anomaly localization for network data streams with graph joint sparse PCA

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Rapid detection of maintenance induced changes in service performance

Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Direct Robust Matrix Factorizatoin for Anomaly Detection

ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Fast anomaly detection for streaming data

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Progress in root cause and fault propagation analysis of large-scale industrial processes

Journal of Control Science and Engineering
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference

Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers. This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.