Performance Modeling Based on Real Data: A Case Study
IEEE Transactions on Computers - Fault-Tolerant Computing
Predictability of Process Resource Usage: A Measurement-Based Study on UNIX
IEEE Transactions on Software Engineering
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
Performability Analysis Using Semi-Markov Reward Processes
IEEE Transactions on Computers
High-Availability Computer Systems
Computer
Two techniques for transient software error recovery
Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
Clustering Algorithms
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Measurement of Failure Rate in Widely Distributed Software
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Software Reliability and Rejuvenation: Modeling and Analysis
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Model-Based Evaluation: From Dependability to Security
IEEE Transactions on Dependable and Secure Computing
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Performability analysis of clustered systems with rejuvenation under varying workload
Performance Evaluation
Modeling and analysis of software aging and software failure
Journal of Systems and Software
IEEE Transactions on Parallel and Distributed Systems
Estimating Periodic Software Rejuvenation Schedules under Discrete-Time Operation Circumstance
IEICE - Transactions on Information and Systems
Availability analysis of application servers using software rejuvenation and virtualization
Journal of Computer Science and Technology
Proactive management of software aging
IBM Journal of Research and Development
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Managing performance of aging applications via synchronized replica rejuvenation
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Memory leak analysis of mission-critical middleware
Journal of Systems and Software
Towards IT systems capable of managing their health
FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Prediction-Based software availability enhancement
Self-star Properties in Complex Information Systems
Software rejuvenation in the cloud
Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
A proactive approach towards always-on availability in broadband cable networks
Computer Communications
Predicting aging-related bugs using software complexity metrics
Performance Evaluation
A survey of software aging and rejuvenation studies
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Workload-aware anomaly detection for Web applications
Journal of Systems and Software
Hi-index | 0.00 |
Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of "software aging" [1], one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Earlier work in this area to detect aging and to estimate its effect on system resources does not take into account the system workload [2]. In this paper, we propose a measurement-based model to estimate the time to exhaustion of operating system resources both as a function of time and the system workload state. The semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates for the resources. With the help of this measure, pro-active fault management techniques such as ``Software rejuvenation'' [1] may be employed to prevent unexpected outages.