Failover and takeover contingency mechanisms for network partition and node failure

  • Authors:
  • Macías López;Laura M. Castro;David Cabrero

  • Affiliations:
  • University of A Coruña, A Coruña, Spain;University of A Coruña, A Coruña, Spain;University of A Coruña, A Coruña, Spain

  • Venue:
  • Proceedings of the eleventh ACM SIGPLAN workshop on Erlang workshop
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Proper definition of suitable mechanisms to cope with network partition and to recover from node failure are among the most common problems when designing and implementing a fault-tolerant distributed system. The concern is even more serious when the different scenarios could not be predicted beforehand and are detected once the system is at deployment stage. There are a number of decisions that can be made when choosing the right contingency mechanisms to deal with these distribution-bounded problems. The factors that must be taken into account include not only the technology in use, the node layout, the message protocol and the properties of the messages to be exchanged, certain desired/demanded features such as latency, bandwidth,... but also the communications network reliability, and even the hardware where the system is running on. In this paper we present ADVERTISE, a distributed system for advertisement transmission to on-customer-home set-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator. We use this system as a case study to explain how we addressed the aforementioned problems, and present a set of good practices that can be extrapolated to comparable systems.