Fail-Stutter Fault Tolerance

Authors:
Remzi H. Arpaci-Dusseau;Andrea C. Arpaci-Dusseau
Affiliations:
-;-
Venue:
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Year:
2001

Citing 0
Cited 21

Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
Bridging the Information Gap in Storage Protocol Stacks

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Cheap recovery: a key to self-managing state

ACM Transactions on Storage (TOS)
Storage-Aware Caching: Revisiting Caching for Heterogeneous Storage Systems

FAST '02 Proceedings of the 1st USENIX Conference on File and Storage Technologies
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Automating data dependability

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Measurement and analysis of TCP throughput collapse in cluster-based storage systems

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
On application-level approaches to avoiding TCP throughput collapse in cluster-based storage systems

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
High-available grid services through the use of virtualized clustering

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Chain replication in theory and in practice

Proceedings of the 9th ACM SIGPLAN workshop on Erlang
Storage-aware caching: revisiting caching for heterogeneous storage systems

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Root-cause analysis of performance anomalies in web-based applications

Proceedings of the 2011 ACM Symposium on Applied Computing
Disks are like snowflakes: no two are alike

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Faults in large distributed systems and what we can do about them

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Automated diagnosis without predictability is a recipe for failure

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Fault tolerance: case study

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Traditional fault models present system designers with two extremes: the Byzantine fault model, which is general and therefore difficult to apply, and the fail-stop fault model, which is easier to employ but does not accurately capture modern device behavior. To address this gap, we introduce the concept of fail-stutter fault tolerance, a realistic and yet tractable fault model that accounts for both absolute failure and a new range of performance failures common in modern components. Systems built under the fail-stutter model will likely perform well, be highly reliable and available, and be easier to manage when deployed.