Analysis and implementation of software rejuvenation in cluster systems

Authors:
Kalyanaraman Vaidyanathan;Richard E. Harper;Steven W. Hunter;Kishor S. Trivedi
Affiliations:
Dept. of ECE, Duke University, Durham, NC;IBM Research, Raleigh, NC;IBM Corporation, RTP, NC;Dept. of ECE, Duke University, Durham, NC
Venue:
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Year:
2001

Citing 17
Cited 22

Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
High-Availability Computer Systems

Computer
Software dependability in the operational phase

Software dependability in the operational phase
Minimizing completion time of a program by checkpointing and rejuvenation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimal software rejuvenation for tolerating soft failures

Performance Evaluation
Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Monitoring Smoothly Degrading Systems for Increased Dependability

Empirical Software Engineering
Windows NT Clustering Service

Computer
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
On-Board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Software Reliability and Rejuvenation: Modeling and Analysis

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Proactive Detection of Software Aging Mechanisms in Performance Critical Computers

SEW '02 Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02)
A Comprehensive Model for Software Rejuvenation

IEEE Transactions on Dependable and Secure Computing
Performability analysis of clustered systems with rejuvenation under varying workload

Performance Evaluation
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Software rejuvenation in embedded systems

Journal of Automata, Languages and Combinatorics
Availability analysis of blade server systems

IBM Systems Journal
Proactive management of software aging

IBM Journal of Research and Development
Achieving and assuring high availability

ISAS'08 Proceedings of the 5th international conference on Service availability
Analysis of a software system with rejuvenation, restoration and checkpointing

ISAS'08 Proceedings of the 5th international conference on Service availability
Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Journal of Systems and Software
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
Storage-Based Intrusion Detection

ACM Transactions on Information and System Security (TISSEC)
Architecting dependable systems with proactive fault management

Architecting dependable systems VII
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
F(I)MEA-technique of web services analysis and dependability ensuring

Rigorous Development of Complex Fault-Tolerant Systems
A proactive approach towards always-on availability in broadband cable networks

Computer Communications
A comparative experimental study of software rejuvenation overhead

Performance Evaluation
Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration

Performance Evaluation
A survey of software aging and rejuvenation studies

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several recent studies have reported the phenomenon of "software aging", one in which the state of a software system degrades with time. This may eventually lead to performance degradation of the software or crash/hang failure or both. "Software rejuvenation" is a pro-active technique aimed to prevent unexpected or unplanned outages due to aging. The basic idea is to stop the running software, clean its internal state and restart it. In this paper, we discuss software rejuvenation as applied to cluster systems. This is both an innovative and an efficient way to improve cluster system availability and productivity. Using Stochastic Reward Nets (SRNs), we model and analyze cluster systems which employ software rejuvenation. For our proposed time-based rejuvenation policy, we determine the optimal rejuvenation interval based on system availability and cost. We also introduce a new rejuvenation policy based on prediction and show that it can dramatically increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior and performability measures, which we are just beginning to explore. We then briefly describe an implementation of a software rejuvenation system that performs periodic and predictive rejuvenation, and show some empirical data from systems that exhibit aging