Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
The Transis approach to high availability cluster communication
Communications of the ACM
Petal: distributed virtual disks
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
ACM Transactions on Computer Systems (TOCS)
Practical Byzantine fault tolerance
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Replication and fault-tolerance in the ISIS system
Proceedings of the tenth ACM symposium on Operating systems principles
BASE: using abstraction to improve fault tolerance
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Group communication specifications: a comprehensive study
ACM Computing Surveys (CSUR)
Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers
Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks
DISC '02 Proceedings of the 16th International Conference on Distributed Computing
Consensus service: a modular approach for building agreement protocols in distributed systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fast Replicated State Machines Over Partitionable Networks
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS
VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Separating agreement from execution for byzantine fault tolerant services
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Framework for Dynamic Byzantine Storage
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Farsite: federated, available, and reliable storage for an incompletely trusted environment
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fine-grained network time synchronization using reference broadcasts
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Consistent and automatic replica regeneration
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Chain replication for supporting high throughput and availability
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Boxwood: abstractions as the foundation for storage infrastructure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
The Farsite project: a retrospective
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Large-scale byzantine fault tolerance: safe but not always live
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
ACM SIGACT News
A platform for cooperative server backups based on virtual machines
ISAS'08 Proceedings of the 5th international conference on Service availability
Programming evolvable web services
Proceedings of the 2nd International Workshop on Principles of Engineering Service-Oriented Systems
Mencius: building efficient replicated state machines for WANs
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Paxos replicated state machines as the basis of a high-performance data store
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Dynamic reconfiguration of primary/backup clusters
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Spanner: Google's globally-distributed database
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
ID-replication for structured peer-to-peer systems
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Maguro, a system for indexing and searching over very large text collections
Proceedings of the sixth ACM international conference on Web search and data mining
Spanner: Google’s Globally Distributed Database
ACM Transactions on Computer Systems (TOCS)
CATS: a linearizable and self-organizing key-value store
Proceedings of the 4th annual Symposium on Cloud Computing
Optimizing Paxos with request exchangeability for highly available web services
Proceedings of the 5th Asia-Pacific Symposium on Internetware
On the efficiency of durable state machine replication
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
Many stateful services use the replicated state machine approach for high availability. In this approach, a service runs on multiple machines to survive machine failures. This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service. SMART improves upon existing techniques in three important ways. First, SMART allows migrations that replace non-failed machines. Thus, SMART enables load balancing and lets an automated system replace failed machines. Such autonomic migration is an important step toward full autonomic operation, in which administrators play a minor role and need not be available twenty-four hours a day, seven days a week. Second, SMART can pipeline concurrent requests, a useful performance optimization. Third, prior published migration techniques are described in insufficient detail to admit implementation, whereas our description of SMART is complete. In addition to describing SMART, we also demonstrate its practicality by implementing it, evaluating our implementation's performance, and using it to build a consistent, replicated, migratable file system. Our experiments demonstrate the performance advantage of pipelining concurrent requests, and show that migration has only a minor and temporary effect on performance.