Self-Caring IT Systems: A Proof-of-Concept Implementation in Virtualized Environments

  • Authors:
  • Selvi Kadirvel;Jose A. B. Fortes

  • Affiliations:
  • -;-

  • Venue:
  • CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In self-caring IT systems, faults are handled proactively, e.g. by slowing down the deterioration of system health thereby effectively avoiding or delaying system failures. This requires health management which entails health monitoring, diagnosis, prognosis, planning of recovery and remediation actions. A brief overview of our prior work, which proposes a general methodology to capture system properties and incorporate health management using Petri nets, is provided. We describe in detail an application of the proposed formal method to the design and development of middleware that can manage the health of a batch-based, job submission system on a virtualized platform. First, we describe how a real world job submission IT system is converted to a Petri net model. Secondly, we show system validation and analysis using this model to understand resource needs of different activities in the IT chain. Thirdly, we describe how the executable model is used as a system manager to control operation and health management of a virtualized test bed. Fourthly, we illustrate the use of a feedback controller to manage health deterioration due to resource depletion in the job-execution stage of the modeled IT chain. Using a proof-of-concept implementation, we show that the early detection and handling of health deteriorations results in significant benefits in terms of cost savings and down time reduction. Experimental results show that our health management framework can be used to effectively prevent job failures, while imposing low overhead to the managed system. We have shown that for a typical workload consisting of jobs that suffer from potential resource depletion faults, our feedback controller can be used to gain useful life that is needed for critical planning and remediation actions in up to 82% of the jobs.