On the road to recovery: restoring data after disasters

Authors:
Kimberly Keeton;Dirk Beyer;Ernesto Brau;Arif Merchant;Cipriano Santos;Alex Zhang
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA
Venue:
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Year:
2006

Citing 17
Cited 14

Scheduling Tasks with Resource Requirements in Hard Real-Time Systems

IEEE Transactions on Software Engineering
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Genetic algorithms + data structures = evolution programs (3rd ed.)

Genetic algorithms + data structures = evolution programs (3rd ed.)
Survivable Information Storage Systems

Computer
A Genetic Algorithm for Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
Multiprocessor Scheduling of Processes with Release Times, Deadlines, Precedence, and Exclusion Relations

IEEE Transactions on Software Engineering
A genetic algorithm for resource-constrained scheduling

A genetic algorithm for resource-constrained scheduling
Workflow Management: Models, Methods, and Systems

Workflow Management: Models, Methods, and Systems
A Framework for Evaluating Storage System Dependability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
FAB: building distributed enterprise disk arrays from commodity components

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Value-maximizing deadline scheduling and its application to animation rendering

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Designing for Disasters

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Going with the Flow

Queue - Workflow Systems
Lessons and challenges in automating data dependability

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Challenges in managing dependable data systems

ACM SIGMETRICS Performance Evaluation Review - Design, implementation, and performance of storage systems
Content Manager Backup/Recovery and High Availability: Strategies, Options, and Procedures (IBM Redbooks)

Content Manager Backup/Recovery and High Availability: Strategies, Options, and Procedures (IBM Redbooks)
Total recall: system support for automated availability management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1

Challenges in managing dependable data systems

ACM SIGMETRICS Performance Evaluation Review - Design, implementation, and performance of storage systems
Discrete control for safe execution of IT automation workflows

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Don't settle for less than the best: use optimization to make decisions

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Using utility to provision storage systems

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Automated planners for storage provisioning and disaster recovery

IBM Journal of Research and Development
Traveling to Rome: a retrospective on the journey

ACM SIGOPS Operating Systems Review
Improving the responsiveness of internet services with automatic cache placement

Proceedings of the 4th ACM European conference on Computer systems
Smoke and mirrors: reflecting files at a geographically remote location without loss of performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Applying genetic algorithms to decision making in autonomic computing systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Scheduling cooperative emergency response: or how the meek shall overcome the greedy

Proceedings of the 2009 International Conference on Wireless Communications and Mobile Computing: Connecting the World Wirelessly
Disaster recovery as a cloud service: economic benefits & deployment challenges

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Plato: a genetic algorithm approach to run-time reconfiguration in autonomic computing systems

Cluster Computing
Planning for optimal multi-site data distribution for disaster recovery

GECON'11 Proceedings of the 8th international conference on Economics of Grids, Clouds, Systems, and Services
A fast disaster recovery mechanism for volume replication systems

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling decisions. They schedule recovery based on rules of thumb, or on pre-determined orders that might not be best for the failure occurrence. With multiple workloads and recovery techniques, the number of possibilities is large, so the decision process is not trivial.This paper makes several contributions to the area of data recovery scheduling. First, we formalize the description of potential recovery processes by defining recovery graphs. Recovery graphs explicitly capture alternative approaches for recovering workloads, including their recovery tasks, operational states, timing information and precedence relationships. Second, we formulate the data recovery scheduling problem as an optimization problem, where the goal is to find the schedule that minimizes the financial penalties due to downtime, data loss and vulnerability to subsequent failures. Third, we present several methods for finding optimal or near-optimal solutions, including priority-based, randomized and genetic algorithm-guided ad hoc heuristics. We quantitatively evaluate these methods using realistic storage system designs and workloads, and compare the quality of the algorithms' solutions to optimal solutions provided by a math programming formulation and to the solutions from a simple heuristic that emulates the choices made by human administrators. We find that our heuristics' solutions improve on the administrator heuristic's solutions, often approaching or achieving optimality.