Oops! Coping with Human Error in IT Systems
Queue - System Failures
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Human-aware computer system design
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Barricade: defending systems against operator mistakes
Proceedings of the 5th European conference on Computer systems
Hi-index | 0.00 |
Motivated by the pressing need for increased dependability in corporate and Internet services and by the perspective that effective recovery can improve dependability as much or more than avoiding failures, we introduce a novel recovery mechanism that gives human system operators the power of system-wide undo. System-wide undo allows operators to roll back erroneous changes to a service's state without losing end-user data or updates, to make retroactive repairs in the historical timeline of the service system, and thereby to quickly recover from catastrophic state corruption, operator error, failed upgrades, and external attacks, even when the root cause of the catastrophe is unknown. We explore system-wide undo via a framework based on the novel concept of spheres of undo, bubbles of state and time that provide scope to the state recoverable by undo and serve as a structuring tool for implementing undo on standalone services, hierarchically-composed systems, and distributed interacting services. Crucially, spheres of undo allow us to define the concept of paradoxes, inconsistencies that occur when an undo process retroactively alters state that has been exposed outside of its containing sphere of undo. Managing paradoxes is the grand challenge of system-wide undo, and to tackle it we introduce a framework that automatically detects and compensates for paradoxes; our approach exploits the relaxed consistency semantics already present in existing services that interact with human end-users. We describe an implementation of our system-wide undo framework for standalone services with human end-users. We explore its applicability by assembling and evaluating a prototype undoable e-mail store service, by analyzing what would be necessary to construct an undoable online auction service, and by developing a set of guidelines to help service designers retrofit their services with undo. We find that system-wide undo functionality imposes non-negligible but tolerable overhead in terms of both time and space. Using a novel methodology we develop to benchmark human-assisted recovery processes, we also find that undo-based recovery has a net positive effect on dependability, providing significant improvements in correctness while only slightly degrading availability.