Fault-Tolerant Replication Management in Large-Scale Distributed Storage Systems

Authors:
Richard Golding;Elizabeth Borowsky
Affiliations:
-;-
Venue:
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Year:
1999

Citing 13
Cited 3

Maintaining availability in partitioned replicated databases

ACM Transactions on Database Systems (TODS)
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
The weakest failure detector for solving consensus

PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Group communication

Communications of the ACM
Horus: a flexible group communication system

Communications of the ACM
Petal: distributed virtual disks

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Probabilistic quorum systems

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Capacity planning with phased workloads

Proceedings of the 1st international workshop on Software and performance
Coyote: a system for constructing fine-grain configurable communication services

ACM Transactions on Computer Systems (TOCS)
Voting with Regenerable Volatile Witnesses

Proceedings of the Seventh International Conference on Data Engineering
Weighted voting for replicated data

SOSP '79 Proceedings of the seventh ACM symposium on Operating systems principles
The Decentralized Non-Blocking Atomic Commitment Protocol

SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing

Peer to Peer: Peering into the Future

Advanced Lectures on Networking, NETWORKING 2002 [This book presents the revised version of seven tutorials given at the NETWORKING 2002 Conference in Pisa, Italy in May 2002]
Peer to peer: peering into the future

Advanced lectures on networking
Walking toward moving goalposts: agile management for evolving systems

HotACI'06 Proceedings of the First international conference on Hot topics in autonomic computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failures of all forms happen: from losing single network packets to site-wide disasters. Since businesses rely heavily on their data, it is imperative that failures require minimal time and effort to repair and that the service interruption during the failure or repair period should be as short as possible. To this end, the ideal system should repair itself, relying on humans only when absolutely necessary in the repair process. This paper describes one component of a self-healing storage system: the component that allows for automatic recovery of access to data when the power comes back on after a large-scale outage. Our failure recovery protocol is part of a suite of modular protocols that make up the Palladio distributed storage system. This protocol guarantees that service will be repaired quickly and automatically when enough failures are repaired.