Distributed databases principles and systems
Distributed databases principles and systems
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Information Processing Letters
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Guardians and Actions: Linguistic Support for Robust, Distributed Programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
Distributed Systems - Architecture and Implementation, An Advanced Course
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Atomic actions in concurrent systems (fault-tolerance, control)
Atomic actions in concurrent systems (fault-tolerance, control)
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Hi-index | 0.00 |
This paper develops an abstract model which presents a method of uniform description of different rollback recovery control algorithms for distributed systems. We first developed a general definition of the distributed rollback recovery control problem. The concept of a distributed recovery control system (DRC system), consisting of distributed recovery control units (DRC units), is proposed to model recovery with various control granularities. Then, we developed a graph model, called dependency graph, for distributed rollback recovery control algorithms. An atomic subgraph is defined as a subgraph induced by a set of nodes which has no outgoing arcs to other nodes in the graph. Committing and aborting atomic actions can be modeled as identifying atomic subgraphs. Next, we defined two kinds of dependency graphs: checkpoint graphs and unit graphs, based on the dependency relation defined by rollback propagation. We have shown that various types of distributed recovery control algorithms can be classified based on the identifications of atomic subgraphs in these two graphs. Therefore, using the model may allow us to describe existing algorithms in a uniform way and, more importantly, to find new algorithms.