Measuring the Robustness of a Resource Allocation
IEEE Transactions on Parallel and Distributed Systems
Improving storage system availability with D-GRAID
ACM Transactions on Storage (TOS)
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
An online evolutionary approach to developing internet services
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
A utility-centered approach to building dependable infrastructure services
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Emergent (mis)behavior vs. complex software systems
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Database-aware semantically-smart storage
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
A Robust Spanning Tree Topology for Data Collection and Dissemination in Distributed Environments
IEEE Transactions on Parallel and Distributed Systems
Graceful degradation via versions: specifications and implementations
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Software maturity: design as dark art
ACM SIGSOFT Software Engineering Notes
A Multidisciplinary Framework For Resilence To Disasters And Disruptions
Journal of Integrated Design & Process Science
Avalanche Dynamics in Grids: Indications of SOC or HOT?
Proceedings of the 2005 conference on Self-Organization and Autonomic Informatics (I)
A case for on-machine load balancing
Journal of Parallel and Distributed Computing
Efficient middleware for byzantine fault tolerant database replication
Proceedings of the sixth conference on Computer systems
Improving storage system availability with D-GRAID
FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
The robustness of resource allocations in parallel and distributed computing systems
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
Automated diagnosis without predictability is a recipe for failure
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Natural Computing: an international journal
Failure recovery: when the cure is worse than the disease
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Hi-index | 0.00 |
Abstract: This paper argues that a common design paradigm for systems is fundamentally flawed, resulting in unstable, unpredictable behavior as the complexity of the system grows. In this flawed paradigm, designers carefully attempt to predict the operating environment and failure modes of the system in order to design its basic operational mechanisms. However, as a system grows in complexity, the diffuse coupling between the components in the system inevitably leads to the butterfly effect, in which small perturbations can result in large changes in behavior. We explore this in the context of distributed data structures, a scalable, cluster-based storage server. We then consider a number of design techniques that help a system to be robust in the face of the unexpected, including overprovisioning, admission control, introspection, adaptivity through closed control loops. Ultimately, however, all complex systems eventually must contend with the unpredictable. Because of this, we believe systems should be designed to cope with failure gracefully.