Combining statistical monitoring and predictable recovery for self-management

Authors:
Armando Fox;Emre Kiciman;David Patterson
Affiliations:
Stanford University;Stanford University;University of California, Berkeley
Venue:
WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Year:
2004

Citing 16
Cited 13

Detection of abrupt changes: theory and application

Detection of abrupt changes: theory and application
Towards an active network architecture

ACM SIGCOMM Computer Communication Review
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Tracking down software bugs using automatic anomaly detection

Proceedings of the 24th International Conference on Software Engineering
Using Control Theory to Achieve Service Level Objectives In Performance Management

Real-Time Systems
Lessons from Giant-Scale Services

IEEE Internet Computing
Finding surprising patterns in a time series database in linear time and space

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Improving the reliability of commodity operating systems

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Autonomous recovery in componentized Internet applications

Cluster Computing
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1

Sympathy for the sensor network debugger

Proceedings of the 3rd international conference on Embedded networked sensor systems
Towards a debugging system for sensor networks

International Journal of Network Management
Why did my pc suddenly slow down?

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Evaluating the recovery-oriented approach through the systematic development of real complex applications

Software—Practice & Experience
Utility-driven proactive management of availability in enterprise-scale information flows

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
A multi-agent self-adaptative management framework

International Journal of Network Management
Learning and multiagent reasoning for autonomous agents

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Suelo: human-assisted sensing for exploratory soil monitoring studies

Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing

Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Utility-driven proactive management of availability in enterprise-scale information flows

Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
I-queue: smart queues for service management

ICSOC'06 Proceedings of the 4th international conference on Service-Oriented Computing
Prediction-Based software availability enhancement

Self-star Properties in Complex Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical network-based applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of many common recoverable failures. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the system---what parts of the system, in what combinations, are being exercised in response to different kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benefits of the approach that make it competitive as a path toward self-managing systems, and outline some research challenges. Our hope is that this approach will enable "new science" in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability.