Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
Efficiency vs. portability in cluster-based network servers
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Using Abstraction to Improve Fault Tolerance
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Lazy modular upgrades in persistent object stores
OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
A recovery-oriented approach to dependable services: repairing past errors with system-wide undo
A recovery-oriented approach to dependable services: repairing past errors with system-wide undo
Devirtualizable virtual machines enabling general, single-node, online maintenance
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Usable Autonomic Computing Systems: The Administrator's Perspective
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Discovering Correctness Constraints for Self-Management of System Configuration
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Scheduling and simulation: how to upgrade distributed systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Magpie: online modelling and performance-aware systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Proactive recovery in a Byzantine-fault-tolerant system
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
AOSD for internet service clusters: the case of availability
AOMD '05 Proceedings of the 1st workshop on Aspect oriented middleware development
Model-based validation for dealing with operator mistakes
Proceedings of the twentieth ACM symposium on Operating systems principles
Selective early request termination for busy internet services
Proceedings of the 15th international conference on World Wide Web
A: an assertion language for distributed systems
Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems
Correlating multi-session attacks via replay
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Operating systems should support business change
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Human-aware computer system design
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic configuration of internet services
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Discrete control for safe execution of IT automation workflows
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Staged deployment in mirage, an integrated software upgrade testing and distribution system
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Improving file system reliability with I/O shepherding
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Towards Scheduling Virtual Machines Based On Direct User Input
VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
Delta execution for software reliability
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
SPIKE: best practice generation for storage area networks
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
DieCast: testing distributed systems with an accurate scale model
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Evaluating distributed systems: does background traffic matter?
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Towards automatic reverse engineering of software security configurations
Proceedings of the 15th ACM conference on Computer and communications security
Efficient online validation with delta execution
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Barricade: defending systems against operator mistakes
Proceedings of the 5th European conference on Computer systems
Splitter: a proxy-based approach for post-migration testing of web applications
Proceedings of the 5th European conference on Computer systems
Service combinators for farming virtual machines
COORDINATION'08 Proceedings of the 10th international conference on Coordination models and languages
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Dependency-aware maintenance for highly available service-oriented grid
Journal of Systems and Software
Automatically generating predicates and solutions for configuration troubleshooting
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
JustRunIt: experiment-based management of virtualized data centers
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Automating configuration troubleshooting with dynamic information flow analysis
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
MassConf: automatic configuration tuning by leveraging user community information
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Correlating multi-session attacks via replay
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Toward online testing of federated and heterogeneous distributed systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
An empirical study on configuration errors in commercial and open source systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Why do migrations fail and what can we do about it?
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
X-ray: automating root-cause diagnosis of performance anomalies in production software
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Do not blame users for misconfigurations
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
EnCore: exploiting system environment and correlation information for misconfiguration detection
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Back to the future: fault-tolerant live update with time-traveling state transfer
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Hi-index | 0.00 |
Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of experiments using human operators and a realistic three-tier auction service. The mistakes we observed range from software misconfiguration, to fault misdiagnosis, to incorrect software restarts. We next propose to validate operator actions before they are made visible to the rest of the system. We demonstrate how to accomplish this task via the creation of a validation environment that is an extension of the online system, where components can be validated using real workloads before they are migrated into the running service. We show that our prototype validation system can detect 66% of the operator mistakes that we have observed.