Performance problem localization in self-healing, service-oriented systems using Bayesian networks

  • Authors:
  • Rui Zhang;Steve Moyle;Steve McKeever;Alan Bivens

  • Affiliations:
  • Oxford University, Oxford, England;Oxford University, Oxford, England;Oxford University, Oxford, England;IBM T.J. Watson Research Center, Hawthorne, N.Y.

  • Venue:
  • Proceedings of the 2007 ACM symposium on Applied computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In distributed, service-oriented environments, performance problem localization is required to provide self-healing capabilities and deliver the desired quality of service (QoS). This paper presents an automated approach to identifying system elements causing performance problems. Applying probabilistic inference to collected response time and elapsed time data, the approach 1) infers elapsed time for services where data is missing, 2) estimates the response time degradation caused by different services using the duration, abnormality and response time correlation of their elapsed times, and 3) identifies the services that are the most important causes of slow response time and yield the most benefit if recovered. The approach has been used to localize a performance problem on the test bed of a real-world service-oriented Grid. Evaluation using simulations shows that the approach consistently achieves better accuracy than traditional techniques in various service-oriented settings.