Performance problem localization in self-healing, service-oriented systems using Bayesian networks

Authors:
Rui Zhang;Steve Moyle;Steve McKeever;Alan Bivens
Affiliations:
Oxford University, Oxford, England;Oxford University, Oxford, England;Oxford University, Oxford, England;IBM T.J. Watson Research Center, Hawthorne, N.Y.
Venue:
Proceedings of the 2007 ACM symposium on Applied computing
Year:
2007

Citing 9
Cited 2

Probabilistic reasoning in expert systems: theory and algorithms

Probabilistic reasoning in expert systems: theory and algorithms
The Vision of Autonomic Computing

Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
Dynamic Provisioning of Multi-tier Internet Applications

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
OGSA-based grid workload monitoring

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Comparing the use of bayesian networks and neural networks in response time modeling for service-oriented systems

Proceedings of the 2007 workshop on Service-oriented computing performance: aspects, issues, and approaches
Scalable problem localization for distributed systems: principles and practices

Proceedings of the 2nd international conference on Scalable information systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In distributed, service-oriented environments, performance problem localization is required to provide self-healing capabilities and deliver the desired quality of service (QoS). This paper presents an automated approach to identifying system elements causing performance problems. Applying probabilistic inference to collected response time and elapsed time data, the approach 1) infers elapsed time for services where data is missing, 2) estimates the response time degradation caused by different services using the duration, abnormality and response time correlation of their elapsed times, and 3) identifies the services that are the most important causes of slow response time and yield the most benefit if recovered. The approach has been used to localize a performance problem on the test bed of a real-world service-oriented Grid. Evaluation using simulations shows that the approach consistently achieves better accuracy than traditional techniques in various service-oriented settings.