Fault-tolerance in a distributed management system: a case study

  • Authors:
  • Robert Smeikal;Karl M. Goeschka

  • Affiliations:
  • Vienna University of Technology, Vienna, Austria;Frequentis Nachrichtentechnik GmbH, Vienna, Austria

  • Venue:
  • Proceedings of the 25th International Conference on Software Engineering
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Our case study provides the most important conceptual lessons learned from the implementation of a Distributed Telecommunication Management System (DTMS), which controls a networked voice communication system. Major requirements for the DTMS are fault-tolerance against site or network failures, transactional safety, and reliable persistence. In order to provide distribution and persistence both transparently and fault-tolerant we introduce a two-layer architecture facilitating an asynchronous replication algorithm. Among the lessons learned are: component based software engineering poses a significant initial overhead but is worth it in the long term; a fault-tolerant naming service is a key requirement for fail-safe distribution; the reasonable granularity for persistence and concurrency control is one whole object; asynchronous replication on the database layer is superior to synchronous replication on the instance level in terms of robustness and consistency; semi-structured persistence with XML has drawbacks regarding consistency, performance and convenience; in contrast to an arbitrarily meshed object model, a accentuated hierarchical structure is more robust and feasible; a query engine has to provide a means for navigation through the object model; finally the propagation of deletion operation becomes more complex in an object-oriented model. By incorporating these lessons learned we are well underway to provide a highly available, distributed platform for persistent object systems.