Performance assertion checking
SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
The Vision of Autonomic Computing
Computer
Proceedings of the twenty-second annual symposium on Principles of distributed computing
A recovery-oriented approach to dependable services: repairing past errors with system-wide undo
A recovery-oriented approach to dependable services: repairing past errors with system-wide undo
Automatic Model-Driven Recovery in Distributed Systems
SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
Human-aware computer system design
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Understanding and validating database system administration
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
LiveOps: systems management as a service
LISA '06 Proceedings of the 20th conference on Large Installation System Administration
Automatic configuration of internet services
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
AutoBash: improving configuration management with operating system causality analysis
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Model-Based Validation for Internet Services
SRDS '09 Proceedings of the 2009 28th IEEE International Symposium on Reliable Distributed Systems
Advanced tools for operators at amazon.com
HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing
A hierarchy-based fault-local stabilizing algorithm for tracking in sensor networks
OPODIS'04 Proceedings of the 8th international conference on Principles of Distributed Systems
An empirical study on configuration errors in commercial and open source systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Hi-index | 0.00 |
In this paper, we propose a management framework for protecting large computer systems against operator mistakes. By detecting and confining mistakes to isolated portions of the managed system, our framework facilitates correct operation even by inexperienced operators. We built a prototype management system called Barricade based on our framework. We evaluate Barricade by deploying it for two different systems, a prototype Internet service and an enterprise computer infrastructure, and conducting experiments with 20 volunteer operators. Our results are very promising. For example, we show that Barricade can detect and contain 39 out of the 43 mistakes that we observed in 49 live operator experiments performed with our Internet service.