Towards IT systems capable of managing their health

Authors:
Selvi Kadirvel;José A. B. Fortes
Affiliations:
Advanced Computing and Information Systems Lab, NSF Center for Autonomic Computing, University of Florida, Gainesville, Florida;Advanced Computing and Information Systems Lab, NSF Center for Autonomic Computing, University of Florida, Gainesville, Florida
Venue:
FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Year:
2010

Citing 28
Cited 0

Business process redesign: a Petri-net-based approach

Computers in Industry - Special double issue: WET ICE '95
Modeling Web application architectures with UML

Communications of the ACM
Petri Net Theory and the Modeling of Systems

Petri Net Theory and the Modeling of Systems
Internet-Based Workflow Management: Towards a Semantic Web

Internet-Based Workflow Management: Towards a Semantic Web
Bayesian approaches to failure prediction for disk drives

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Feedback Control of Computing Systems

Feedback Control of Computing Systems
Systems Integration of Large Scale Autonomic Systems Using Multiple Domain Specific Modeling Languages

ECBS '05 Proceedings of the 12th IEEE International Conference and Workshops on Engineering of Computer-Based Systems
An analytical model for multi-tier internet services and its applications

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Comprehensive Model for Software Rejuvenation

IEEE Transactions on Dependable and Secure Computing
Automatic Model-Driven Recovery in Distributed Systems

SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
Verifying Autonomic Fault Mitigation Strategies in Large Scale Real-Time Systems

EASE '06 Proceedings of the Third IEEE International Workshop on Engineering of Autonomic & Autonomous Systems
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Autonomic Computing

Autonomic Computing
Performance modeling and system management for multi-component online services

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Achieving an acceptable design model for autonomic systems

EASE '07 Proceedings of the Fourth IEEE International Workshop on Engineering of Autonomic and Autonomous Systems
Coloured Petri Nets and CPN Tools for modelling and validation of concurrent systems

International Journal on Software Tools for Technology Transfer (STTT)
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Facilitating a Well-Founded Approach to Autonomic Systems

EASE '08 Proceedings of the Fifth IEEE Workshop on Engineering of Autonomic and Autonomous Systems
Detection and Prediction of Resource-Exhaustion Vulnerabilities

ISSRE '08 Proceedings of the 2008 19th International Symposium on Software Reliability Engineering
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
An adaptive feedback controller for SIP server memory overload protection

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
How to keep your head above water while detecting errors

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Self-Caring IT Systems: A Proof-of-Concept Implementation in Virtualized Environments

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
A Petri net model for service availability in redundant computing systems

Winter Simulation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Self-caring systems are systems capable of monitoring and managing their own health and, indirectly, their useful lifetime. Unlike self-healing systems which are reactive to faults and failures, self-caring systems are aware of their health and hence can potentially circumvent and adapt to impending faults, or recover from them quicker and more effectively. Towards a methodology to model and incorporate health management logic and control mechanisms into an Information Technology (IT) system whose health needs to be managed, we propose the following: