A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

Authors:
Kalyanaraman Vaidyanathan;Kishor S. Trivedi
Affiliations:
-;-
Venue:
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Year:
1999

Citing 14
Cited 24

Performance Modeling Based on Real Data: A Case Study

IEEE Transactions on Computers - Fault-Tolerant Computing
Predictability of Process Resource Usage: A Measurement-Based Study on UNIX

IEEE Transactions on Software Engineering
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
Performability Analysis Using Semi-Markov Reward Processes

IEEE Transactions on Computers
High-Availability Computer Systems

Computer
Two techniques for transient software error recovery

Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
Clustering Algorithms

Clustering Algorithms
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Measurement of Failure Rate in Widely Distributed Software

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Effect of System Workload on Operating System Reliability: A Study on IBM 3081

IEEE Transactions on Software Engineering

Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Software Reliability and Rejuvenation: Modeling and Analysis

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Model-Based Evaluation: From Dependability to Security

IEEE Transactions on Dependable and Secure Computing
A Comprehensive Model for Software Rejuvenation

IEEE Transactions on Dependable and Secure Computing
Performability analysis of clustered systems with rejuvenation under varying workload

Performance Evaluation
Modeling and analysis of software aging and software failure

Journal of Systems and Software
Analysis and optimization of service availability in a HA cluster with load-dependent machine availability

IEEE Transactions on Parallel and Distributed Systems
Estimating Periodic Software Rejuvenation Schedules under Discrete-Time Operation Circumstance

IEICE - Transactions on Information and Systems
Availability analysis of application servers using software rejuvenation and virtualization

Journal of Computer Science and Technology
Proactive management of software aging

IBM Journal of Research and Development
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Managing performance of aging applications via synchronized replica rejuvenation

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Memory leak analysis of mission-critical middleware

Journal of Systems and Software
Towards IT systems capable of managing their health

FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Prediction-Based software availability enhancement

Self-star Properties in Complex Information Systems
Software rejuvenation in the cloud

Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
A proactive approach towards always-on availability in broadband cable networks

Computer Communications
Predicting aging-related bugs using software complexity metrics

Performance Evaluation
Dynamic software rejuvenation policies in a transaction-based system under Markovian arrival processes

Performance Evaluation
Modeling and analysis of software rejuvenation in a server virtualized system with live VM migration

Performance Evaluation
A survey of software aging and rejuvenation studies

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Workload-aware anomaly detection for Web applications

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of "software aging" [1], one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Earlier work in this area to detect aging and to estimate its effect on system resources does not take into account the system workload [2]. In this paper, we propose a measurement-based model to estimate the time to exhaustion of operating system resources both as a function of time and the system workload state. The semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates for the resources. With the help of this measure, pro-active fault management techniques such as ``Software rejuvenation'' [1] may be employed to prevent unexpected outages.