ACM Transactions on Computer Systems (TOCS)
Bridging the Information Gap in Storage Protocol Stacks
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Cheap recovery: a key to self-managing state
ACM Transactions on Storage (TOS)
Storage-Aware Caching: Revisiting Caching for Heterogeneous Storage Systems
FAST '02 Proceedings of the 1st USENIX Conference on File and Storage Technologies
Proceedings of the twentieth ACM symposium on Operating systems principles
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Measurement and analysis of TCP throughput collapse in cluster-based storage systems
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
On application-level approaches to avoiding TCP throughput collapse in cluster-based storage systems
PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
High-available grid services through the use of virtualized clustering
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Tolerating hardware device failures in software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Chain replication in theory and in practice
Proceedings of the 9th ACM SIGPLAN workshop on Erlang
Storage-aware caching: revisiting caching for heterogeneous storage systems
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Root-cause analysis of performance anomalies in web-based applications
Proceedings of the 2011 ACM Symposium on Applied Computing
Disks are like snowflakes: no two are alike
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Faults in large distributed systems and what we can do about them
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Automated diagnosis without predictability is a recipe for failure
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Hi-index | 0.00 |
Abstract: Traditional fault models present system designers with two extremes: the Byzantine fault model, which is general and therefore difficult to apply, and the fail-stop fault model, which is easier to employ but does not accurately capture modern device behavior. To address this gap, we introduce the concept of fail-stutter fault tolerance, a realistic and yet tractable fault model that accounts for both absolute failure and a new range of performance failures common in modern components. Systems built under the fail-stutter model will likely perform well, be highly reliable and available, and be easier to manage when deployed.