Understanding and dealing with operator mistakes in internet services

Authors:
Kiran Nagaraja;Fábio Oliveira;Ricardo Bianchini;Richard P. Martin;Thu D. Nguyen
Affiliations:
Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ
Venue:
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Year:
2004

Citing 17
Cited 39

Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
Efficiency vs. portability in cluster-based network servers

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Using Abstraction to Improve Fault Tolerance

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Lazy modular upgrades in persistent object stores

OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
A recovery-oriented approach to dependable services: repairing past errors with system-wide undo

A recovery-oriented approach to dependable services: repairing past errors with system-wide undo
Devirtualizable virtual machines enabling general, single-node, online maintenance

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Usable Autonomic Computing Systems: The Administrator's Perspective

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Discovering Correctness Constraints for Self-Management of System Configuration

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Scheduling and simulation: how to upgrade distributed systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Proactive recovery in a Byzantine-fault-tolerant system

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4

AOSD for internet service clusters: the case of availability

AOMD '05 Proceedings of the 1st workshop on Aspect oriented middleware development
Model-based validation for dealing with operator mistakes

Proceedings of the twentieth ACM symposium on Operating systems principles
Selective early request termination for busy internet services

Proceedings of the 15th international conference on World Wide Web
A: an assertion language for distributed systems

Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems
Correlating multi-session attacks via replay

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Operating systems should support business change

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Human-aware computer system design

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic configuration of internet services

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Discrete control for safe execution of IT automation workflows

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Staged deployment in mirage, an integrated software upgrade testing and distribution system

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Improving file system reliability with I/O shepherding

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Towards Scheduling Virtual Machines Based On Direct User Input

VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
Delta execution for software reliability

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
SPIKE: best practice generation for storage area networks

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
DieCast: testing distributed systems with an accurate scale model

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Evaluating distributed systems: does background traffic matter?

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Towards automatic reverse engineering of software security configurations

Proceedings of the 15th ACM conference on Computer and communications security
Efficient online validation with delta execution

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Barricade: defending systems against operator mistakes

Proceedings of the 5th European conference on Computer systems
Splitter: a proxy-based approach for post-migration testing of web applications

Proceedings of the 5th European conference on Computer systems
Service combinators for farming virtual machines

COORDINATION'08 Proceedings of the 10th international conference on Coordination models and languages
Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Dependency-aware maintenance for highly available service-oriented grid

Journal of Systems and Software
Automatically generating predicates and solutions for configuration troubleshooting

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
JustRunIt: experiment-based management of virtualized data centers

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
MassConf: automatic configuration tuning by leveraging user community information

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Correlating multi-session attacks via replay

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Toward online testing of federated and heterogeneous distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
An empirical study on configuration errors in commercial and open source systems

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Why do migrations fail and what can we do about it?

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Do not blame users for misconfigurations

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
EnCore: exploiting system environment and correlation information for misconfiguration detection

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Back to the future: fault-tolerant live update with time-traveling state transfer

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of experiments using human operators and a realistic three-tier auction service. The mistakes we observed range from software misconfiguration, to fault misdiagnosis, to incorrect software restarts. We next propose to validate operator actions before they are made visible to the rest of the system. We demonstrate how to accomplish this task via the creation of a validation environment that is an extension of the online system, where components can be validated using real workloads before they are migrated into the running service. We show that our prototype validation system can detect 66% of the operator mistakes that we have observed.