On fault resilience of OpenStack

Authors:
Xiaoen Ju;Livio Soares;Kang G. Shin;Kyung Dong Ryu;Dilma Da Silva
Affiliations:
University of Michigan;IBM T. J. Watson Research Center;University of Michigan;IBM T. J. Watson Research Center;Qualcomm Research Silicon Valley
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 22
Cited 0

An empirical study of the reliability of UNIX utilities

Communications of the ACM
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
WAP5: black-box performance debugging for wide-area systems

Proceedings of the 15th international conference on World Wide Web
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Replay debugging for distributed applications

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
EIO: error handling is occasionally correct

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
MODIST: transparent model checking of unmodified distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
vPath: precise discovery of request processing paths from black-box observations of thread and network activities

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
WiDS checker: combating bugs in distributed systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Efficient Testing of Recovery Code Using Fault Injection

ACM Transactions on Computer Systems (TOCS)
Scalable testing of file system checkers

Proceedings of the 7th ACM european conference on Computer Systems
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Fay: Extensible Distributed Tracing from Kernels to Clusters

ACM Transactions on Computer Systems (TOCS)
Failure recovery: when the cure is worse than the disease

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud-management stacks have become an increasingly important element in cloud computing, serving as the resource manager of cloud platforms. While the functionality of this emerging layer has been constantly expanding, its fault resilience remains under-studied. This paper presents a systematic study of the fault resilience of OpenStack---a popular open source cloud-management stack. We have built a prototype fault-injection framework targeting service communications during the processing of external requests, both among OpenStack services and between OpenStack and external services, and have thus far uncovered 23 bugs in two versions of OpenStack. Our findings shed light on defects in the design and implementation of state-of-the-art cloud-management stacks from a fault-resilience perspective.