Undo for operators: building an undoable e-mail store

Authors:
Aaron B. Brown;David A. Patterson
Affiliations:
University of California, Berkeley, EECS Computer Science Division, Berkeley, CA;University of California, Berkeley, EECS Computer Science Division, Berkeley, CA
Venue:
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Year:
2003

Citing 14
Cited 41

Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging

ACM Transactions on Database Systems (TODS)
Managing update conflicts in Bayou, a weakly connected replicated storage system

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Timewarp: techniques for autonomous collaboration

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Flexible conflict detection and management in collaborative applications

Proceedings of the 10th annual ACM symposium on User interface software and technology
The IceCube approach to the reconciliation of divergent replicas

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
The evolution of Coda

ACM Transactions on Computer Systems (TOCS)
Revocation of Unread E-mail in an Untrusted Network

ACISP '97 Proceedings of the Second Australasian Conference on Information Security and Privacy
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Rewind, repair, replay: three R's to dependability

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4

Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Finding and preventing run-time error handling mistakes

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Recovery-Oriented Computing: Building Multitier Dependability

Computer
Oops! Coping with Human Error in IT Systems

Queue - System Failures
A New Undo Function for Web-Based Management Information Systems

IEEE Internet Computing
Detecting past and present intrusions through vulnerability-specific predicates

Proceedings of the twentieth ACM symposium on Operating systems principles
The taser intrusion recovery system

Proceedings of the twentieth ACM symposium on Operating systems principles
AOSD for internet service clusters: the case of availability

AOMD '05 Proceedings of the 1st workshop on Aspect oriented middleware development
HANet: a framework toward ultimately reliable network services

Journal of Systems and Software
Undo for anyone, anywhere, anytime

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Using time travel to diagnose computer problems

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Doppelganger: Better browser privacy without the bother

Proceedings of the 13th ACM conference on Computer and communications security
Automatic high-performance reconstruction and recovery

Computer Networks: The International Journal of Computer and Telecommunications Networking
Correlating multi-session attacks via replay

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Development tools for distributed applications

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Configuration debugging as search: finding the needle in the haystack

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Kernel support for zero-loss Internet service restart

Software—Practice & Experience
AutoBash: improving configuration management with operating system causality analysis

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Improving file system reliability with I/O shepherding

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Exceptional situations and program reliability

ACM Transactions on Programming Languages and Systems (TOPLAS)
Virtual machine time travel using continuous data protection and checkpointing

ACM SIGOPS Operating Systems Review
Using causality to diagnose configuration bugs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Alcatraz: An Isolated Environment for Experimenting with Untrusted Software

ACM Transactions on Information and System Security (TISSEC)
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Modular data centers: how to design them?

Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Usable autonomic computing systems: The system administrators' perspective

Advanced Engineering Informatics
Proposal on network-wide rollback scheme for fast recovery from operator errors

DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
Toward quantifying system manageability

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Intrusion recovery using selective re-execution

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Automating configuration troubleshooting with dynamic information flow analysis

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Correlating multi-session attacks via replay

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
An empirical study on configuration errors in commercial and open source systems

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Exception-Handling bugs in java and a language extension to avoid them

Advanced Topics in Exception Handling Techniques
Bringing usability concerns to the design of software architecture

EHCI-DSVIS'04 Proceedings of the 2004 international conference on Engineering Human Computer Interaction and Interactive Systems
Using logical data protection and recovery to improve data availability

ISAS'05 Proceedings of the Second international conference on Service Availability
A reversible abstract machine and its space overhead

FMOODS'12/FORTE'12 Proceedings of the 14th joint IFIP WG 6.1 international conference and Proceedings of the 32nd IFIP WG 6.1 international conference on Formal Techniques for Distributed Systems
Stitch: A language for architecture-based self-adaptation

Journal of Systems and Software
Efficient patch-based auditing for web application vulnerabilities

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
EXTERIOR: using a dual-VM based external shell for guest-OS introspection, configuration, and recovery

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Quantified Score

Hi-index	0.02

Visualization

Abstract

System operators play a critical role in maintaining server dependability yet lack powerful tools to help them do so. To help address this unfulfilled need, we describe Operator Undo, a tool that provides a forgiving operations environment by allowing operators to recover from their own mistakes, from unanticipated software problems, and from intentional or accidental data corruption. Operator Undo starts by intercepting and logging user interactions with a network service before they enter the system, creating a record of user intent. During an undo cycle, all system hard state is physically rewound, allowing the operator to perform arbitrary repairs; after repairs are complete, lost user data is reintegrated into the repaired system by replaying the logged user interactions while tracking and compensating for any resulting externally-visible inconsistencies. We describe the design and implementation of an application-neutral framework for Operator Undo, and detail the process by which we instantiated the framework in the form of an undo-capable e-mail store supporting SMTP mail delivery and IMAP mail retrieval. Our proof-of-concept e-mail implementation imposes only a small performance overhead, and can store days or weeks of recovery log on a single disk.