Fault-tolerance in a distributed management system: a case study

Authors:
Robert Smeikal;Karl M. Goeschka
Affiliations:
Vienna University of Technology, Vienna, Austria;Frequentis Nachrichtentechnik GmbH, Vienna, Austria
Venue:
Proceedings of the 25th International Conference on Software Engineering
Year:
2003

Citing 7
Cited 4

A quorum-consensus replication method for abstract data types

ACM Transactions on Computer Systems (TOCS)
Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
The process group approach to reliable distributed computing

Communications of the ACM
Read-only transactions in a distributed database

ACM Transactions on Database Systems (TODS)
Component-based software engineering: putting the pieces together

Component-based software engineering: putting the pieces together
Information architecture: a new discipline for organizing hypertext

Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Replication Techniques in Distributed Systems

Replication Techniques in Distributed Systems

Middleware support for adaptive dependability

Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware
Highly available fault tolerant distributed computing using reflection and replication

Proceedings of the International Conference on Advances in Computing, Communication and Control
Middleware support for adaptive dependability

MIDDLEWARE2007 Proceedings of the 8th ACM/IFIP/USENIX international conference on Middleware
Adaptive voting for balancing data integrity with availability

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our case study provides the most important conceptual lessons learned from the implementation of a Distributed Telecommunication Management System (DTMS), which controls a networked voice communication system. Major requirements for the DTMS are fault-tolerance against site or network failures, transactional safety, and reliable persistence. In order to provide distribution and persistence both transparently and fault-tolerant we introduce a two-layer architecture facilitating an asynchronous replication algorithm. Among the lessons learned are: component based software engineering poses a significant initial overhead but is worth it in the long term; a fault-tolerant naming service is a key requirement for fail-safe distribution; the reasonable granularity for persistence and concurrency control is one whole object; asynchronous replication on the database layer is superior to synchronous replication on the instance level in terms of robustness and consistency; semi-structured persistence with XML has drawbacks regarding consistency, performance and convenience; in contrast to an arbitrarily meshed object model, a accentuated hierarchical structure is more robust and feasible; a query engine has to provide a means for navigation through the object model; finally the propagation of deletion operation becomes more complex in an object-oriented model. By incorporating these lessons learned we are well underway to provide a highly available, distributed platform for persistent object systems.