Proactive management of software aging

Authors:
V. Castelli;R. E. Harper;P. Heidelberger;S. W. Hunter;K. S. Trivedi;K. Vaidyanathan;W. P. Zeggert
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Server Group, Research Triangle Park, North Carolina;Center for Advanced Computing and Communication, Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina;Center for Advanced Computing and Communication, Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina;IBM Server Group, Research Triangle Park, North Carolina
Venue:
IBM Journal of Research and Development
Year:
2001

Citing 22
Cited 52

Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
High-Availability Computer Systems

Computer
Software dependability in the operational phase

Software dependability in the operational phase
Minimizing completion time of a program by checkpointing and rejuvenation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimal software rejuvenation for tolerating soft failures

Performance Evaluation
Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Advanced Computer Architecture: Parallelism,Scalability,Programmability

Advanced Computer Architecture: Parallelism,Scalability,Programmability
Monitoring Smoothly Degrading Systems for Increased Dependability

Empirical Software Engineering
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
SPNP: Stochastic Petri Net Package

PNPM '89 The Proceedings of the Third International Workshop on Petri Nets and Performance Models
On-Board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Reliability Analysis of Clustered Computing Systems

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Effect of System Workload on Operating System Reliability: A Study on IBM 3081

IEEE Transactions on Software Engineering
Transient behavior of ATM networks under overloads

INFOCOM'96 Proceedings of the Fifteenth annual joint conference of the IEEE computer and communications societies conference on The conference on computer communications - Volume 3

Software Reliability and Rejuvenation: Modeling and Analysis

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Adaptive domain model: dealing with multiple attributes of self-managing distributed object systems

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Technical forum—Management of application complexes in multitier clustered systems

IBM Systems Journal
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Proactive Fault Handling for System Availability Enhancement

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Destructive Transaction: Human-Oriented Cluster System Management Mechanism

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A Comprehensive Model for Software Rejuvenation

IEEE Transactions on Dependable and Secure Computing
Ensuring stable performance for systems that degrade

Proceedings of the 5th international workshop on Software and performance
BladeCenter networking

IBM Journal of Research and Development - IBM BladeCenter systems
BladeCenter systems management software

IBM Journal of Research and Development - IBM BladeCenter systems
BladeCenter thermal diagnostics

IBM Journal of Research and Development - IBM BladeCenter systems
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

IEEE Transactions on Dependable and Secure Computing
Performability analysis of clustered systems with rejuvenation under varying workload

Performance Evaluation
Modeling and analysis of software aging and software failure

Journal of Systems and Software
Ensuring system performance for cluster and single server systems

Journal of Systems and Software
Scalable Delivery of Dynamic Content Using a Cooperative Edge Cache Grid

IEEE Transactions on Knowledge and Data Engineering
Automatic software interference detection in parallel applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The application of WBEM standard in database management systems administration tasks

AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
A Faster Estimation Algorithm for Periodic Preventive Rejuvenation Schedule Maximizing System Availability

ISAS '07 Proceedings of the 4th international symposium on Service Availability
Simulation-Based Optimization Approach for Software Cost Model with Rejuvenation

ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
High-available grid services through the use of virtualized clustering

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
Estimating Periodic Software Rejuvenation Schedules under Discrete-Time Operation Circumstance

IEICE - Transactions on Information and Systems
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Availability analysis of application servers using software rejuvenation and virtualization

Journal of Computer Science and Technology
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Current research and practice in proactive fault management

International Journal of Computers and Applications
Self-configuring algorithm for software fault tolerance in (n,k)-way cluster systems

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
Managing performance of aging applications via synchronized replica rejuvenation

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Achieving and assuring high availability

ISAS'08 Proceedings of the 5th international conference on Service availability
Analysis of a software system with rejuvenation, restoration and checkpointing

ISAS'08 Proceedings of the 5th international conference on Service availability
Memory leak analysis of mission-critical middleware

Journal of Systems and Software
Methods and opportunities for rejuvenation in aging distributed software systems

Journal of Systems and Software
Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Journal of Systems and Software
A proactive fault-detection mechanism in large-scale cluster systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architecting dependable systems with proactive fault management

Architecting dependable systems VII
Automatic synthesis of SRN models from system operation templates for availability analysis

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Monitoring the health condition of a ubiquitous system: rejuvenation vs. recovery

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Modeling and cost analysis of nested software rejuvenation policy

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part III
Prediction-Based software availability enhancement

Self-star Properties in Complex Information Systems
Analysis of a service degradation model with preventive rejuvenation

ISAS'06 Proceedings of the Third international conference on Service Availability
Tolerance of design faults

Dependable and Historic Computing
A proactive approach towards always-on availability in broadband cable networks

Computer Communications
A survivability model in wireless sensor networks

Computers & Mathematics with Applications
Towards dependable clients: improving the reliability and availability of the browsers

Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
A comparative experimental study of software rejuvenation overhead

Performance Evaluation
Dynamic software rejuvenation policies in a transaction-based system under Markovian arrival processes

Performance Evaluation
Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration

Performance Evaluation
A survey of software aging and rejuvenation studies

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Software rejuvenation scheduling using accelerated life testing

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to "software aging" as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. "Software rejuvenation" is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. Software rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director for xSeries servers. To quantitatively evaluate the impact of different rejuvenation policies on the availability of cluster systems, we have developed analytical models based on stochastic reward nets (SRNs). For timebased rejuvenation policies, we determined the optimal rejuvenation interval based on system availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore.