SLA-aware resource over-commit in an IaaS cloud

  • Authors:
  • David Breitgand;Zvi Dubitzky;Amir Epstein;Alex Glikson;Inbar Shapira

  • Affiliations:
  • IBM Haifa Research Lab, Haifa, Israel;IBM Haifa Research Lab, Haifa, Israel;IBM Haifa Research Lab, Haifa, Israel;IBM Haifa Research Lab, Haifa, Israel;IBM Haifa Research Lab, Haifa, Israel

  • Venue:
  • Proceedings of the 8th International Conference on Network and Service Management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cloud paradigm facilitates cost-efficient elastic computing allowing scaling workloads on demand. As cloud size increases, the probability that all workloads simultaneously scale up to their maximum demand, diminishes. This observation allows multiplexing cloud resources among multiple workloads, greatly improving resource utilization. The ability to host virtualized workloads such that available physical capacity is smaller than the sum of maximal demands of the workloads, is referred to as over-commit or over-subscription. Naturally, over-commit implies risk of resource congestion. Therefore, there is a tradeoff between improving resource utilization by increasing an over-commit ratio and exposing the infrastructure provider and customers to the risk of resource congestion. In this work, we observe that while resource multiplexing naturally occurs in the cloud, the risks associated with exploiting it for higher levels of cloud utilization, are not transparent to the customers. We consider workloads comprising elastic groups of Virtual Machines (VMs). We suggest that cloud providers would extend a standard availability Service Level Agreement (SLA) to express the probability of successfully launching a VM (to expand a workload), complementing the current practice of offering a standard SLA on availability of VMs which are already successfully launched. Using the proposed extended availability SLA, we introduce a notion of the cloud effective demand, which generalizes previously introduced notions of effective size of a single VM and effective bandwidth of stand-alone and multiplexed network connections. We propose an algorithmic framework that uses cloud effective demand to estimate the total physical capacity required for SLA compliance under over-commit. We evaluate our proposed methodology using simulations based on the data collected from a real private cloud production environment.