Detection of abrupt changes: theory and application
Detection of abrupt changes: theory and application
Towards an active network architecture
ACM SIGCOMM Computer Communication Review
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Tracking down software bugs using automatic anomaly detection
Proceedings of the 24th International Conference on Software Engineering
Lessons from Giant-Scale Services
IEEE Internet Computing
Finding surprising patterns in a time series database in linear time and space
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Bug isolation via remote program sampling
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Autonomous recovery in componentized Internet applications
Cluster Computing
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Using runtime paths for macroanalysis
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Magpie: online modelling and performance-aware systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Sympathy for the sensor network debugger
Proceedings of the 3rd international conference on Embedded networked sensor systems
Towards a debugging system for sensor networks
International Journal of Network Management
Why did my pc suddenly slow down?
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Software—Practice & Experience
Utility-driven proactive management of availability in enterprise-scale information flows
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
A multi-agent self-adaptative management framework
International Journal of Network Management
Learning and multiagent reasoning for autonomous agents
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Suelo: human-assisted sensing for exploratory soil monitoring studies
Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing
Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Utility-driven proactive management of availability in enterprise-scale information flows
Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
I-queue: smart queues for service management
ICSOC'06 Proceedings of the 4th international conference on Service-Oriented Computing
Prediction-Based software availability enhancement
Self-star Properties in Complex Information Systems
Hi-index | 0.00 |
Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical network-based applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of many common recoverable failures. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the system---what parts of the system, in what combinations, are being exercised in response to different kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benefits of the approach that make it competitive as a path toward self-managing systems, and outline some research challenges. Our hope is that this approach will enable "new science" in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability.