Pinpointing the Subsystems Responsible for the Performance Deviations in a Load Test

  • Authors:
  • Haroon Malik;Bram Adams;Ahmed E. Hassan

  • Affiliations:
  • -;-;-

  • Venue:
  • ISSRE '10 Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large scale systems (LSS) contain multiple subsystems that interact across multiple nodes in sometimes unforeseen and complicated ways. As a result, pinpointing the subsystems that are the source of performance degradation for a load test in LSS can be frustrating, and might take several hours or even days. This is due to the large volume of performance counter data collected such as CPU utilization, Disk I/O, memory consumption and network traffic, the limited operational knowledge of analysts about all subsystems of an LSS and the unavailability of up-to-date documentation in a LSS. We have developed a methodology that automatically ranks the subsystems according to the deviation of their performance in a load test. Our methodology uses performance counter data of a load test to craft performance signatures for the LSS subsystems. Pair-wise correlations among the performance signatures of subsystems within a load test are compared with the corresponding correlations in a baseline test to pinpoint the subsystems responsible for the performance violations. Case studies on load test data obtained from a large telecom system and that of an open source benchmark application show that our approach provides an accuracy of 79% and do not require any instrumentation or domain knowledge to operate.