Process-oriented recovery for operations on cloud applications

  • Authors:
  • Min Fu;Liming Zhu;Anna Liu;Xiwei Xu;Len Bass

  • Affiliations:
  • University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia

  • Venue:
  • Proceedings of the 4th annual Symposium on Cloud Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

A large number of cloud application failures happen during sporadic operations on cloud applications, such as upgrade, deployment reconfiguration, migration and scaling-out/in. Most of them are caused by operator and process errors [1]. From a cloud consumer's perspective, recovery from these failures relies on the limited control and visibility provided by the cloud providers. In addition, a large-scale system often has multiple operation processes happening simultaneously, which exacerbates the problem during error diagnosis and recovery. Existing built-in or infrastructure-based recovery mechanisms often assume random component failures and use checkpoint-based rollback, compensation actions [2], redundancy and rejuvenation to handle recovery [3]. These recovery mechanisms do not consider the characteristics of a specific operation process that consists of a set of steps carried out by scripts and humans interacting with fragile cloud infrastructure APIs and uncertain resources [4]. Other approaches such as FATE/DESTINI [5] look at the process implied by a system's internal protocols and rely on the built-in recovery protocol to detect and recover from bugs. The problem we target is at a different level related to the external sporadic activities operating on a hosted cloud application.