The resiliency challenge presented by soft failure incidents

  • Authors:
  • J. M. Caffrey

  • Affiliations:
  • IBM Systems and Technology Group, Poughkeepsie, NY

  • Venue:
  • IBM Systems Journal
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A common problem observed on mainframe installations, and one which presents a significant challenge for resiliency and high availability, involves soft failure incidents. In contrast to catastrophic failures, soft failures involve some degree of system shutdown without an obvious cause. This has been described with the phrase: "Systems don't break; they just stop running, and we don't know why." Extending a medical paradigm, this paper proposes a new method for solutions deployed on IBM z/OS™ systems to respond when either the system or the application stops running. The current approach is to treat the "disease," by determining the cause of the problem and taking action to prevent its recurrence. The new approach is to determine whether the system or application is behaving abnormally, identify the cause of this abnormal behavior, and take action to treat the "symptom." This new approach uses machine learning and mathematical modeling to identify normal behavior, enabling the detection of abnormal behavior before it impacts the customer. Based on an analysis of critical problems and preliminary modeling work, the types of abnormal behavior identified are assigned to broad categories. In this paper, we describe the progress being made to address the challenge of soft failures by implementing this new paradigm.