Investigation of failure causes in workload-driven reliability testing

  • Authors:
  • Domenico Cotroneo;Roberto Pietrantuono;Leonardo Mariani;Fabrizio Pastore

  • Affiliations:
  • Università degli Studi di Napoli Federico II, Naples, Italy;Università degli Studi di Napoli Federico II, Naples, Italy;Università degli Studi di Milano Bicocca, Milano;Università degli Studi di Milano Bicocca, Milano

  • Venue:
  • Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Virtual execution environments and middleware are required to be extremely reliable because applications running on top of them are developed assuming their correctness, and platform-level failures can result in serious and unexpected application-level problems. Since software platforms and middleware are often executed for long time without any interruption, large part of the testing process is devoted to investigate their behavior when long and stressful executions occur (these test cases are called workloads). When a problem is identified, software engineers examine log files to find its root cause. Unfortunately, since of the workloads length, log files can contain a huge amount of information and manual analysis is often prohibitive. Thus, de-facto, the identification of the root cause is mostly left to the intuition of the software engineer. In this paper, we propose a technique to automatically analyze logs obtained from workloads to retrieve important information that can relate the failure to its cause. The technique works in three steps: (1) during workload executions, the system under test is monitored; (2) logs extracted from workloads that have been successfully completed are used to derive compact and general models of the expected behavior of the target system; (3) logs corresponding to workloads terminated unsuccessfully are compared with the inferred models to identify anomalous event sequences. Anomalies help software engineers to identify failure causes. The technique can also be used during operational phase, to discover possible causes of unexpected failures by comparing logs corresponding to failing executions with models derived at testing time. Preliminary experimental results conducted on the Java Virtual Machine indicate that several bugs can be rapidly identified thanks to the feedbacks provided by our technique.