Failover and takeover contingency mechanisms for network partition and node failure

Authors:
Macías López;Laura M. Castro;David Cabrero
Affiliations:
University of A Coruña, A Coruña, Spain;University of A Coruña, A Coruña, Spain;University of A Coruña, A Coruña, Spain
Venue:
Proceedings of the eleventh ACM SIGPLAN workshop on Erlang workshop
Year:
2012

Citing 18
Cited 0

Models of machines and computation for mapping in multicomputers

ACM Computing Surveys (CSUR)
Designing distributed applications with mobile code paradigms

ICSE '97 Proceedings of the 19th international conference on Software engineering
A Majority consensus approach to concurrency control for multiple copy databases

ACM Transactions on Database Systems (TODS)
Towards robust distributed systems (abstract)

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

ACM SIGACT News
Software Engineering with Agents: Pitfalls and Pratfalls

IEEE Internet Computing
Distributed Fault-Tolerant Real-Time Systems: The Mars Approach

IEEE Micro
Mnesia - A Distributed Robust DBMS for Telecommunications Applications

PADL '99 Proceedings of the First International Workshop on Practical Aspects of Declarative Languages
Network Distributed Computing: Fitscapes and Fallacies

Network Distributed Computing: Fitscapes and Fallacies
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A model for characterizing the scalability of distributed systems

ACM SIGOPS Operating Systems Review
Fault-tolerance in the borealis distributed stream processing system

ACM Transactions on Database Systems (TODS)
Restful web services vs. "big"' web services: making the right architectural decision

Proceedings of the 17th international conference on World Wide Web
Programming Erlang: Software for a Concurrent World

Programming Erlang: Software for a Concurrent World
An investigation of the Internet's IP-layer connectivity

Computer Communications
ERLANG Programming

ERLANG Programming
A Case Study on Verifying a Supervisor Component Using McErlang

Electronic Notes in Theoretical Computer Science (ENTCS)
Erlang and OTP in Action

Erlang and OTP in Action

Quantified Score

Hi-index	0.00

Visualization

Abstract

Proper definition of suitable mechanisms to cope with network partition and to recover from node failure are among the most common problems when designing and implementing a fault-tolerant distributed system. The concern is even more serious when the different scenarios could not be predicted beforehand and are detected once the system is at deployment stage. There are a number of decisions that can be made when choosing the right contingency mechanisms to deal with these distribution-bounded problems. The factors that must be taken into account include not only the technology in use, the node layout, the message protocol and the properties of the messages to be exchanged, certain desired/demanded features such as latency, bandwidth,... but also the communications network reliability, and even the hardware where the system is running on. In this paper we present ADVERTISE, a distributed system for advertisement transmission to on-customer-home set-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator. We use this system as a case study to explain how we addressed the aforementioned problems, and present a set of good practices that can be extrapolated to comparable systems.