Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
High-Availability Computer Systems
Computer
Software dependability in the operational phase
Software dependability in the operational phase
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimal software rejuvenation for tolerating soft failures
Performance Evaluation
Analysis of Preventive Maintenance in Transactions Based Software Systems
IEEE Transactions on Computers
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Monitoring Smoothly Degrading Systems for Increased Dependability
Empirical Software Engineering
Computer
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
On-Board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Software Reliability and Rejuvenation: Modeling and Analysis
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Proactive Detection of Software Aging Mechanisms in Performance Critical Computers
SEW '02 Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02)
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Performability analysis of clustered systems with rejuvenation under varying workload
Performance Evaluation
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Software rejuvenation in embedded systems
Journal of Automata, Languages and Combinatorics
Availability analysis of blade server systems
IBM Systems Journal
Proactive management of software aging
IBM Journal of Research and Development
Achieving and assuring high availability
ISAS'08 Proceedings of the 5th international conference on Service availability
Analysis of a software system with rejuvenation, restoration and checkpointing
ISAS'08 Proceedings of the 5th international conference on Service availability
Journal of Systems and Software
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Storage-Based Intrusion Detection
ACM Transactions on Information and System Security (TISSEC)
Architecting dependable systems with proactive fault management
Architecting dependable systems VII
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
F(I)MEA-technique of web services analysis and dependability ensuring
Rigorous Development of Complex Fault-Tolerant Systems
A proactive approach towards always-on availability in broadband cable networks
Computer Communications
A comparative experimental study of software rejuvenation overhead
Performance Evaluation
A survey of software aging and rejuvenation studies
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
Several recent studies have reported the phenomenon of "software aging", one in which the state of a software system degrades with time. This may eventually lead to performance degradation of the software or crash/hang failure or both. "Software rejuvenation" is a pro-active technique aimed to prevent unexpected or unplanned outages due to aging. The basic idea is to stop the running software, clean its internal state and restart it. In this paper, we discuss software rejuvenation as applied to cluster systems. This is both an innovative and an efficient way to improve cluster system availability and productivity. Using Stochastic Reward Nets (SRNs), we model and analyze cluster systems which employ software rejuvenation. For our proposed time-based rejuvenation policy, we determine the optimal rejuvenation interval based on system availability and cost. We also introduce a new rejuvenation policy based on prediction and show that it can dramatically increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior and performability measures, which we are just beginning to explore. We then briefly describe an implementation of a software rejuvenation system that performs periodic and predictive rejuvenation, and show some empirical data from systems that exhibit aging