A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Experiments on six commercial TCP implementations using a software fault injection tool
Software—Practice & Experience
Korat: automated testing based on Java predicates
ISSTA '02 Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis
Comparing the Robustness of POSIX Operating Systems
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Policy/mechanism separation in Hydra
SOSP '75 Proceedings of the fifth ACM symposium on Operating systems principles
Decoupling policy from mechanism in Internet routing
ACM SIGCOMM Computer Communication Review
Testing of java web services for robustness
ISSTA '04 Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
Error Propagation Profiling of Operating Systems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Proceedings of the twentieth ACM symposium on Operating systems principles
FAIL-FCI: Versatile fault injection
Future Generation Computer Systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Using model checking to find serious file system errors
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
The Chubby lock service for loosely-coupled distributed systems
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Automated testing of refactoring engines
Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
MODIST: transparent model checking of unmodified distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Hadoop: The Definitive Guide
Test generation through programming in UDITA
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Benchmarking cloud serving systems with YCSB
Proceedings of the 1st ACM symposium on Cloud computing
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
An extensible technique for high-precision testing of recovery code
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Towards automatically checking thousands of failures with micro-specifications
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Failure scenario as a service (FSaaS) for Hadoop clusters
Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Hi-index | 0.00 |
As hardware failures are no longer rare in the era of cloud computing, cloud software systems must "prevail" against multiple, diverse failures that are likely to occur. Testing software against multiple failures poses the problem of combinatorial explosion of multiple failures. To address this problem, we present PreFail, a programmable failure-injection tool that enables testers to write a wide range of policies to prune down the large space of multiple failures. We integrate PreFail to three cloud software systems (HDFS, Cassandra, and ZooKeeper), show a wide variety of useful pruning policies that we can write for them, and evaluate the speed-ups in testing time that we obtain by using the policies. In our experiments, our testing approach with appropriate policies found all the bugs that one can find using exhaustive testing while spending 10X--200X less time than exhaustive testing.