Congestion avoidance and control
SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
TCP/IP illustrated (vol. 1): the protocols
TCP/IP illustrated (vol. 1): the protocols
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Realizing fault resilience in Web-server cluster
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
BASE: using abstraction to improve fault tolerance
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Proceedings of the 8th annual international conference on Mobile computing and networking
Congestion Control in Linux TCP
Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference
SSLACC: A Clustered SSL Accelerator
Proceedings of the 11th USENIX Security Symposium
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
FT-NFS: an efficient fault-tolerant NFS server designed for off-the-shelf workstations
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
TFT: A Software System for Application-Transparent Fault Tolerance
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Robust TCP Connections for Fault Tolerant Computing
ICPADS '02 Proceedings of the 9th International Conference on Parallel and Distributed Systems
HYDRANET-FT: Network Support for Dependable Services
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Migratory TCP: Connection Migration for Service Continuity in the Internet
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
Implementing CIFS: The Common Internet File System
Implementing CIFS: The Common Internet File System
SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
HotSwap-Transparent Server Failover for Linux
LISA '02 Proceedings of the 16th USENIX conference on System administration
Recovering Internet Service Sessions from Operating System Failures
IEEE Internet Computing
A System Demonstration of ST-TCP
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Fine-grained failover using connection migration
USITS'01 Proceedings of the 3rd conference on USENIX Symposium on Internet Technologies and Systems - Volume 3
Live wide-area migration of virtual machines including local persistent state
Proceedings of the 3rd international conference on Virtual execution environments
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Managing self-inflicted nondeterminism
HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Hi-index | 0.00 |
This article describes an architecture that allows a replicated service to survive crashes without breaking its TCP connections. Our approach does not require modifications to the TCP protocol, to the operating system on the server, or to any of the software running on the clients. Furthermore, it runs on commodity hardware. We compare two implementations of this architecture (one based on primary/backup replication and another based on message logging) focusing on scalability, failover time, and application transparency. We evaluate three types of services: a file server, a Web server, and a multimedia streaming server. Our experiments suggest that the approach incurs low overhead on throughput, scales well as the number of clients increases, and allows recovery of the service in near-optimal time.