SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
MulVAL: a logic-based network security analyzer
SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
EXPLODE: a lightweight, general system for finding serious storage system errors
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
The Chubby lock service for loosely-coupled distributed systems
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
On designing and deploying internet-scale services
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Toward a cloud computing research agenda
ACM SIGACT News
MODIST: transparent model checking of unmodified distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Boom analytics: exploring data-centric, declarative programming for the cloud
Proceedings of the 5th European conference on Computer systems
Benchmarking cloud serving systems with YCSB
Proceedings of the 1st ACM symposium on Cloud computing
SQCK: a declarative file system checker
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
GATEKEEPER: mostly static enforcement of security and reliability policies for javascript code
SSYM'09 Proceedings of the 18th conference on USENIX security symposium
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
An extensible technique for high-precision testing of recovery code
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Towards automatically checking thousands of failures with micro-specifications
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Life, death, and the critical transition: finding liveness bugs in systems code
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
WiDS checker: combating bugs in distributed systems
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Using declarative invariants for protecting file-system integrity
PLOS '11 Proceedings of the 6th Workshop on Programming Languages and Operating Systems
PREFAIL: a programmable tool for multiple-failure injection
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Scalable testing of file system checkers
Proceedings of the 7th ACM european conference on Computer Systems
Fast black-box testing of system recovery code
Proceedings of the 7th ACM european conference on Computer Systems
Don't trust your roommate or access control and replication protocols in "Home" environments
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Be conservative: enhancing failure diagnosis with proactive logging
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Failure scenario as a service (FSaaS) for Hadoop clusters
Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Adversarial testing of wireless routing implementations
Proceedings of the sixth ACM conference on Security and privacy in wireless and mobile networks
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
Process-oriented recovery for operations on cloud applications
Proceedings of the 4th annual Symposium on Cloud Computing
Detecting cloud provisioning errors using an annotated process model
Proceedings of the 8th Workshop on Middleware for Next Generation Internet Computing
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
HARDFS: hardening HDFS with selective and lightweight versioning
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
As the cloud era begins and failures become commonplace, failure recovery becomes a critical factor in the availability, reliability and performance of cloud services. Unfortunately, recovery problems still take place, causing downtimes, data loss, and many other problems. We propose a new testing framework for cloud recovery: FATE (Failure Testing Service) and DESTINI (Declarative Testing Specifications). With FATE, recovery is systematically tested in the face of multiple failures. With DESTINI, correct recovery is specified clearly, concisely, and precisely. We have integrated our framework to several cloud systems (e.g., HDFS [33]), explored over 40,000 failure scenarios, wrote 74 specifications, found 16 new bugs, and reproduced 51 old bugs.