Diagnosis of recurrent faults using log files

  • Authors:
  • Thomas Reidemeister;Mohammad Ahmad Munawar;Miao Jiang;Paul A. S. Ward

  • Affiliations:
  • University of Waterloo, Ontario, Canada;University of Waterloo, Ontario, Canada;University of Waterloo, Ontario, Canada;University of Waterloo, Ontario, Canada

  • Venue:
  • CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Enterprise software systems (ESS) are becoming larger and increasingly complex. Failure in business-critical systems is expensive, leading to consequences such as loss of critical data, loss of sales, customer dissatisfaction, even law suits. Therefore, detecting failures and diagnosing their root-cause in a timely manner is essential. Many studies suggest that a large fraction of failures encountered in practice are recurrent (i.e., they have been seen before). Fast and accurate detection of these failures can accelerate problem determination, and thereby improve system reliability. To this effect, we explore machine learning techniques, including the Naïve Bayes classifier, partially-supervised learning, and decision trees (using C4.5), to automatically recognize symptoms of recurrent faults and to derive detection rules from samples of log data. This work focuses on log files, since they are readily available and they do not put any additional computational burden on the component generating the data. The methods explored in this work can aid the development of tools to assist support personnel in problem determination tasks. Instead of requiring the operators to manually define patterns for identifying recurrent problems, such tools can be trained using prior, solved and unsolved cases from existing support databases.