Practical and low-overhead masking of failures of TCP-based servers

Authors:
Dmitrii Zagorodnov;Keith Marzullo;Lorenzo Alvisi;Thomas C. Bressoud
Affiliations:
University of California, Santa Barbara, Santa Barbara, CA;University of California, San Diego, La Jolla, CA;The University of Texas at Austin, Austin, TX;Denison University, Granville, OH
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2009

Citing 27
Cited 0

Congestion avoidance and control

SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
TCP/IP illustrated (vol. 1): the protocols

TCP/IP illustrated (vol. 1): the protocols
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Realizing fault resilience in Web-server cluster

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Reliable network connections

Proceedings of the 8th annual international conference on Mobile computing and networking
Congestion Control in Linux TCP

Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference
SSLACC: A Clustered SSL Accelerator

Proceedings of the 11th USENIX Security Symposium
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
FT-NFS: an efficient fault-tolerant NFS server designed for off-the-shelf workstations

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
TFT: A Software System for Application-Transparent Fault Tolerance

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Robust TCP Connections for Fault Tolerant Computing

ICPADS '02 Proceedings of the 9th International Conference on Parallel and Distributed Systems
HYDRANET-FT: Network Support for Dependable Services

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Migratory TCP: Connection Migration for Service Continuity in the Internet

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Tapping TCP Streams

NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
Implementing CIFS: The Common Internet File System

Implementing CIFS: The Common Internet File System
Using Program Analysis to Identify and Compensate for Nondeterminism in Fault-Tolerant, Replicated Systems

SRDS '04 Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems
HotSwap-Transparent Server Failover for Linux

LISA '02 Proceedings of the 16th USENIX conference on System administration
Recovering Internet Service Sessions from Operating System Failures

IEEE Internet Computing
A System Demonstration of ST-TCP

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Fine-grained failover using connection migration

USITS'01 Proceedings of the 3rd conference on USENIX Symposium on Internet Technologies and Systems - Volume 3
Live wide-area migration of virtual machines including local persistent state

Proceedings of the 3rd international conference on Virtual execution environments
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Managing self-inflicted nondeterminism

HotDep'05 Proceedings of the First conference on Hot topics in system dependability

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article describes an architecture that allows a replicated service to survive crashes without breaking its TCP connections. Our approach does not require modifications to the TCP protocol, to the operating system on the server, or to any of the software running on the clients. Furthermore, it runs on commodity hardware. We compare two implementations of this architecture (one based on primary/backup replication and another based on message logging) focusing on scalability, failover time, and application transparency. We evaluate three types of services: a file server, a Web server, and a multimedia streaming server. Our experiments suggest that the approach incurs low overhead on throughput, scales well as the number of clients increases, and allows recovery of the service in near-optimal time.