Process-oriented recovery for operations on cloud applications

Authors:
Min Fu;Liming Zhu;Anna Liu;Xiwei Xu;Len Bass
Affiliations:
University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 3
Cited 0

FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Cloud API issues: an empirical study and impact

Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
Detecting cloud provisioning errors using an annotated process model

Proceedings of the 8th Workshop on Middleware for Next Generation Internet Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large number of cloud application failures happen during sporadic operations on cloud applications, such as upgrade, deployment reconfiguration, migration and scaling-out/in. Most of them are caused by operator and process errors [1]. From a cloud consumer's perspective, recovery from these failures relies on the limited control and visibility provided by the cloud providers. In addition, a large-scale system often has multiple operation processes happening simultaneously, which exacerbates the problem during error diagnosis and recovery. Existing built-in or infrastructure-based recovery mechanisms often assume random component failures and use checkpoint-based rollback, compensation actions [2], redundancy and rejuvenation to handle recovery [3]. These recovery mechanisms do not consider the characteristics of a specific operation process that consists of a set of steps carried out by scripts and humans interacting with fragile cloud infrastructure APIs and uncertain resources [4]. Other approaches such as FATE/DESTINI [5] look at the process implied by a system's internal protocols and rely on the built-in recovery protocol to detect and recover from bugs. The problem we target is at a different level related to the external sporadic activities operating on a hosted cloud application.