The SMART way to migrate replicated stateful services

Authors:
Jacob R. Lorch;Atul Adya;William J. Bolosky;Ronnie Chaiken;John R. Douceur;Jon Howell
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research
Venue:
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Year:
2006

Citing 22
Cited 16

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
The Transis approach to high availability cluster communication

Communications of the ACM
Petal: distributed virtual disks

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Group communication specifications: a comprehensive study

ACM Computing Surveys (CSUR)
Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers

Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
Consensus service: a modular approach for building agreement protocols in distributed systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fast Replicated State Machines Over Partitionable Networks

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Semi-Passive Replication

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS

VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Separating agreement from execution for byzantine fault tolerant services

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Framework for Dynamic Byzantine Storage

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fine-grained network time synchronization using reference broadcasts

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Consistent and automatic replica regeneration

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Chain replication for supporting high throughput and availability

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Boxwood: abstractions as the foundation for storage infrastructure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

The Farsite project: a retrospective

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Large-scale byzantine fault tolerance: safe but not always live

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Building reliable large-scale distributed systems: when theory meets practice

ACM SIGACT News
Reconfiguring a state machine

ACM SIGACT News
A platform for cooperative server backups based on virtual machines

ISAS'08 Proceedings of the 5th international conference on Service availability
Programming evolvable web services

Proceedings of the 2nd International Workshop on Principles of Engineering Service-Oriented Systems
Mencius: building efficient replicated state machines for WANs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Paxos replicated state machines as the basis of a high-performance data store

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Dynamic reconfiguration of primary/backup clusters

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
ID-replication for structured peer-to-peer systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Maguro, a system for indexing and searching over very large text collections

Proceedings of the sixth ACM international conference on Web search and data mining
Spanner: Google’s Globally Distributed Database

ACM Transactions on Computer Systems (TOCS)
CATS: a linearizable and self-organizing key-value store

Proceedings of the 4th annual Symposium on Cloud Computing
Optimizing Paxos with request exchangeability for highly available web services

Proceedings of the 5th Asia-Pacific Symposium on Internetware
On the efficiency of durable state machine replication

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many stateful services use the replicated state machine approach for high availability. In this approach, a service runs on multiple machines to survive machine failures. This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service. SMART improves upon existing techniques in three important ways. First, SMART allows migrations that replace non-failed machines. Thus, SMART enables load balancing and lets an automated system replace failed machines. Such autonomic migration is an important step toward full autonomic operation, in which administrators play a minor role and need not be available twenty-four hours a day, seven days a week. Second, SMART can pipeline concurrent requests, a useful performance optimization. Third, prior published migration techniques are described in insufficient detail to admit implementation, whereas our description of SMART is complete. In addition to describing SMART, we also demonstrate its practicality by implementing it, evaluating our implementation's performance, and using it to build a consistent, replicated, migratable file system. Our experiments demonstrate the performance advantage of pipelining concurrent requests, and show that migration has only a minor and temporary effect on performance.