Detection of abrupt changes: theory and application
Detection of abrupt changes: theory and application
An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation
Journal of Network and Systems Management
Scaling personalized web search
WWW '03 Proceedings of the 12th international conference on World Wide Web
The link prediction problem for social networks
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
CNSR '04 Proceedings of the Second Annual Conference on Communication Networks and Services Research
Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Service oriented architectures: approaches, technologies and research issues
The VLDB Journal — The International Journal on Very Large Data Bases
Machine learning approaches to network anomaly detection
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Dustminer: troubleshooting interactive complexity bugs in sensor networks
Proceedings of the 6th ACM conference on Embedded network sensor systems
ACM Computing Surveys (CSUR)
CSMR '09 Proceedings of the 2009 European Conference on Software Maintenance and Reengineering
ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Automated anomaly detection and performance modeling of enterprise applications
ACM Transactions on Computer Systems (TOCS)
Sensor faults: Detection methods and prevalence in real-world datasets
ACM Transactions on Sensor Networks (TOSN)
Data stream anomaly detection through principal subspace tracking
Proceedings of the 2010 ACM Symposium on Applied Computing
DIAMOND: Correlation-Based Anomaly Monitoring Daemon for DIME
ISM '10 Proceedings of the 2010 IEEE International Symposium on Multimedia
A flexible architecture integrating monitoring and analytics for managing large-scale data centers
Proceedings of the 8th ACM international conference on Autonomic computing
Anomaly localization for network data streams with graph joint sparse PCA
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Rapid detection of maintenance induced changes in service performance
Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies
Direct Robust Matrix Factorizatoin for Anomaly Detection
ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Fast anomaly detection for streaming data
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Progress in root cause and fault propagation analysis of large-scale industrial processes
Journal of Control Science and Engineering
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers. This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.