Supporting undoability in systems operations

Authors:
Ingo Weber;Hiroshi Wada;Alan Fekete;Anna Liu;Len Bass
Affiliations:
NICTA, Sydney and School of Computer Science and Engineering, University of New South Wales;NICTA, Sydney and School of Computer Science and Engineering, University of New South Wales;NICTA, Sydney and School of Information Technologies, University of Sydney;NICTA, Sydney and School of Computer Science and Engineering, University of New South Wales;NICTA, Sydney and School of Computer Science and Engineering, University of New South Wales
Venue:
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Year:
2013

Citing 21
Cited 0

Sagas

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Compensation is Not Enough

EDOC '03 Proceedings of the 7th International Conference on Enterprise Distributed Object Computing
Deployment and Dynamic Reconfiguration Planning for Distributed Software Systems

ICTAI '03 Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
A planning based approach to failure recovery in distributed systems

WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Automatic Model-Driven Recovery in Distributed Systems

SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
Rewind, repair, replay: three R's to dependability

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Achieving self-healing in service delivery software systems by means of case-based reasoning

Applied Intelligence
Model-lite planning for the web age masses: the challenges of planning with incomplete and evolving domain models

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
The FF planning system: fast plan generation through heuristic search

Journal of Artificial Intelligence Research
Planning-based configuration and management of distributed systems

IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
A Self-repair Architecture for Cluster Systems

Architecting Dependable Systems VI
Model-Based Planning for State-Related Changes to Infrastructure and Software as a Service Instances in Large Data Centers

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
A framework for automated fault recovery planning in large-scale virtualized infrastructures

MACE'10 Proceedings of the 5th IEEE international conference on Modelling autonomic communication environments
A survey of B-tree logging and recovery techniques

ACM Transactions on Database Systems (TODS)
Automated planning for configuration changes

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Automatic undo for cloud management via AI planning

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
SAP speaks PDDL: exploiting a software-engineering model for planning in business process management

Journal of Artificial Intelligence Research
System structure for software fault tolerance

IEEE Transactions on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

When managing cloud resources, many administrators operate without a safety net. For instance, inadvertently deleting a virtual disk results in the complete loss of the contained data. The facility to undo a collection of changes, reverting to a previous acceptable state, is widely recognized as valuable support for dependability. In this paper, we consider the particular needs of the system administrators managing API-controlled resources, such as cloud resources on the IaaS level. In particular, we propose an approach which is based on an abstract model of the effects of each available operation. Using this model, we check to which degree each operation is undoable. A positive outcome of this check means a formal guarantee that any sequence of calls to such operations can be undone. A negative outcome contains information on the properties preventing undoability, e.g., which operations are not undoable and why. At runtime we can then warn the user intending to use an irreversible operation; if undo is possible and desired, we apply an AI planning technique to automatically create a workflow that takes the system back to the desired earlier state. We demonstrate the feasibility and applicability of the approach with a prototypical implementation and a number of experiments.