Synchronization as a framework for distributed system fault-tolerance design

Authors:
Alexander B. Romanovsky
Affiliations:
St. Petersburg Technical University
Venue:
EW 5 Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring
Year:
1992

Citing 6
Cited 0

Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
Understanding fault-tolerant distributed systems

Communications of the ACM
Reaching Agreement in the Presence of Faults

Journal of the ACM (JACM)
Replicated distributed programs

Proceedings of the tenth ACM symposium on Operating systems principles
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Exception handling: issues and a proposed notation

Communications of the ACM

Quantified Score

Hi-index	0.00

Visualization

Abstract

We shall regard a computer system as a whole which comprises software, hardware and mixed components each of which can, in its turn, present a system. Then the entire system is a multilevel hierarchy.The purpose of this paper is to single out and generalize about the essential features and properties of synchronization and to argue in favour of the idea of designing and developing fault-tolerance (FT) for distributed systems on the basis of a multilevel synchronization system. Getting aware of and constructing a synchronization system like this can be, and often is, a basis for providing system FT (with masking, recovery, voting, reconfiguration, migration, warm-starting, etc. subsequently built over). The synchronization system controls and coordinates the operation of all main and redundant components and it is by controlling redundancy that FT is achieved. Thus, error recovery can be done precisely at that system level where there is some redundancy; to do this, the operation of the main component and of the redundant ones has to be synchronized. Besides, for that purpose all those components have to be in a certain known state (e.g. active, backup, fault, identical, alive, etc.) and providing this is just what synchronization is meant to do.We propose to consider designing and developing a hierarchical system of synchronizing (HSS) redundant components in distributed systems the first and foremost stage in providing FT in application systems. In this way the entire FT system can be structuralized naturally. HSS can be used as a basis underlying the implementation of any of the known FT schemes in which redundant components are controlled by different CPUs or can be active on their own (disk-drivers, plants, processes, etc.).Our idea and approach is a special case of the approach developed in [1]. We hope to add to the practical applicability of the latter (for FT design, particularly in Distributed OSs - DOSs) and at the same time to single out and analyse the most essential, underlying feature (i.e. component synchronization) in providing FT.