Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
High-Availability Computer Systems
Computer
Software dependability in the operational phase
Software dependability in the operational phase
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimal software rejuvenation for tolerating soft failures
Performance Evaluation
Analysis of Preventive Maintenance in Transactions Based Software Systems
IEEE Transactions on Computers
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Advanced Computer Architecture: Parallelism,Scalability,Programmability
Advanced Computer Architecture: Parallelism,Scalability,Programmability
Monitoring Smoothly Degrading Systems for Increased Dependability
Empirical Software Engineering
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
SPNP: Stochastic Petri Net Package
PNPM '89 The Proceedings of the Third International Workshop on Petri Nets and Performance Models
On-Board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Reliability Analysis of Clustered Computing Systems
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
IEEE Transactions on Software Engineering
Transient behavior of ATM networks under overloads
INFOCOM'96 Proceedings of the Fifteenth annual joint conference of the IEEE computer and communications societies conference on The conference on computer communications - Volume 3
Software Reliability and Rejuvenation: Modeling and Analysis
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Adaptive domain model: dealing with multiple attributes of self-managing distributed object systems
ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Proactive Fault Handling for System Availability Enhancement
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Ensuring stable performance for systems that degrade
Proceedings of the 5th international workshop on Software and performance
IBM Journal of Research and Development - IBM BladeCenter systems
BladeCenter systems management software
IBM Journal of Research and Development - IBM BladeCenter systems
BladeCenter thermal diagnostics
IBM Journal of Research and Development - IBM BladeCenter systems
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle
IEEE Transactions on Dependable and Secure Computing
Performability analysis of clustered systems with rejuvenation under varying workload
Performance Evaluation
Modeling and analysis of software aging and software failure
Journal of Systems and Software
Ensuring system performance for cluster and single server systems
Journal of Systems and Software
Scalable Delivery of Dynamic Content Using a Cooperative Edge Cache Grid
IEEE Transactions on Knowledge and Data Engineering
Automatic software interference detection in parallel applications
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The application of WBEM standard in database management systems administration tasks
AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
ISAS '07 Proceedings of the 4th international symposium on Service Availability
Simulation-Based Optimization Approach for Software Cost Model with Rejuvenation
ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
High-available grid services through the use of virtualized clustering
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
Estimating Periodic Software Rejuvenation Schedules under Discrete-Time Operation Circumstance
IEICE - Transactions on Information and Systems
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Availability analysis of application servers using software rejuvenation and virtualization
Journal of Computer Science and Technology
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Current research and practice in proactive fault management
International Journal of Computers and Applications
Self-configuring algorithm for software fault tolerance in (n,k)-way cluster systems
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
Managing performance of aging applications via synchronized replica rejuvenation
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Achieving and assuring high availability
ISAS'08 Proceedings of the 5th international conference on Service availability
Analysis of a software system with rejuvenation, restoration and checkpointing
ISAS'08 Proceedings of the 5th international conference on Service availability
Memory leak analysis of mission-critical middleware
Journal of Systems and Software
Methods and opportunities for rejuvenation in aging distributed software systems
Journal of Systems and Software
Journal of Systems and Software
A proactive fault-detection mechanism in large-scale cluster systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architecting dependable systems with proactive fault management
Architecting dependable systems VII
Automatic synthesis of SRN models from system operation templates for availability analysis
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Monitoring the health condition of a ubiquitous system: rejuvenation vs. recovery
EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Modeling and cost analysis of nested software rejuvenation policy
ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part III
Prediction-Based software availability enhancement
Self-star Properties in Complex Information Systems
Analysis of a service degradation model with preventive rejuvenation
ISAS'06 Proceedings of the Third international conference on Service Availability
Dependable and Historic Computing
A proactive approach towards always-on availability in broadband cable networks
Computer Communications
A survivability model in wireless sensor networks
Computers & Mathematics with Applications
Towards dependable clients: improving the reliability and availability of the browsers
Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
A comparative experimental study of software rejuvenation overhead
Performance Evaluation
A survey of software aging and rejuvenation studies
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Software rejuvenation scheduling using accelerated life testing
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.01 |
Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to "software aging" as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. "Software rejuvenation" is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. Software rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director for xSeries servers. To quantitatively evaluate the impact of different rejuvenation policies on the availability of cluster systems, we have developed analytical models based on stochastic reward nets (SRNs). For timebased rejuvenation policies, we determined the optimal rejuvenation interval based on system availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore.