Reliability: probabilistic models and statistical methods
Reliability: probabilistic models and statistical methods
Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
Fault-tolerance in air traffic control systems
ACM Transactions on Computer Systems (TOCS)
Probability and statistics with reliability, queuing and computer science applications
Probability and statistics with reliability, queuing and computer science applications
Dependability Measurement and Modeling of a Multicomputer System
IEEE Transactions on Computers
The Vision of Autonomic Computing
Computer
A Flexible Clustered Approach to High Availability
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule
PRDC '00 Proceedings of the 2000 Pacific Rim International Symposium on Dependable Computing
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
An Approach for Estimation of Software Aging in a Web Server
ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
Analysis of Periodic Preventive Maintenance with General System Failure Distribution
PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
Hi-index | 0.00 |
As increasingly complex computer systems have started playing a controlling role in all aspects of modern life, system availability and associated downtime of technical systems have acquired critical importance. Losses due to system downtime have risen manifold and become wide-ranging. Even though the component level availability of hardware and software has increased considerably, system wide availability still needs improvement as the heterogeneity of components and the complexity of interconnections has gone up considerably too. As systems become more interconnected and diverse, architects are less able to anticipate and design for every interaction among components, leaving such issues to be dealt with at runtime. Therefore, in this paper, we propose an approach for autonomic management of system availability, which provides real-time evaluation, monitoring and management of the availability of systems in critical applications. A hybrid approach is used where analytic models provide the behavioral abstraction of components/subsystems, their interconnections and dependencies and statistical inference is applied on the data from real time monitoring of those components and subsystems, to parameterize the system availability model. The model is solved online (that is, in real time) so that at any instant of time, both the point as well as the interval estimates of the overall system availability are obtained by propagating the point and the interval estimates of each of the input parameters, through the system model. The online monitoring and estimation of system availability can then lead to adaptive online control of system availability.