Proactive management of software aging

  • Authors:
  • V. Castelli;R. E. Harper;P. Heidelberger;S. W. Hunter;K. S. Trivedi;K. Vaidyanathan;W. P. Zeggert

  • Affiliations:
  • IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Server Group, Research Triangle Park, North Carolina;Center for Advanced Computing and Communication, Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina;Center for Advanced Computing and Communication, Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina;IBM Server Group, Research Triangle Park, North Carolina

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2001

Quantified Score

Hi-index 0.01

Visualization

Abstract

Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to "software aging" as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. "Software rejuvenation" is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. Software rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director for xSeries servers. To quantitatively evaluate the impact of different rejuvenation policies on the availability of cluster systems, we have developed analytical models based on stochastic reward nets (SRNs). For timebased rejuvenation policies, we determined the optimal rejuvenation interval based on system availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore.