On termination detection in crash-prone distributed systems with failure detectors

Authors:
Neeraj Mittal;Felix C. Freiling;S. Venkatesan;Lucia Draque Penso
Affiliations:
Department of Computer Science, The University of Texas at Dallas Richardson, TX 75083, USA;Department of Computer Science, University of Mannheim, D-68131 Mannheim, Germany;Department of Computer Science, The University of Texas at Dallas Richardson, TX 75083, USA;Department of Computer Science, University of Mannheim, D-68131 Mannheim, Germany
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 25
Cited 0

How processes learn

Distributed Computing
A new approach to detection of locally indicative stability

International Colloquium on Automata, Languages and Programming on Automata, languages and programming
Optimal distributed algorithms for minimum weight spanning tree, counting, leader election, and related problems

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Consensus in the presence of partial synchrony

Journal of the ACM (JACM)
Global quiescence detection based on credit distribution and recovery

Information Processing Letters
A message-optimal algorithm for distributed termination detection

Journal of Parallel and Distributed Computing
The derivation of distributed termination detection algorithms from garbage collection schemes

ACM Transactions on Programming Languages and Systems (TOPLAS)
An (N -1)-Resilient Algorithm for Distributed Termination Detection

IEEE Transactions on Parallel and Distributed Systems
Detecting termination by weight-throwing in a faulty distributed system

Journal of Parallel and Distributed Computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Computing Global Functions in Asynchronous Distributed Systems with Perfect Failure Detectors

IEEE Transactions on Parallel and Distributed Systems
Distributed Termination

ACM Transactions on Programming Languages and Systems (TOPLAS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Distributed Reset

IEEE Transactions on Computers
Failure Detection Sequencers: Necessary and Sufficient Information about Failures to Solve Predicate Detection

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
(Im)Possibilities of Predicate Detection in Crash-Affected Systems

WSS '01 Proceedings of the 5th International Workshop on Self-Stabilizing Systems
Synchronous System and Perfect Failure Detector: Solvability and Efficiency Issue

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
A Realistic Look At Failure Detectors

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Termination detection in data-driven parallel computations/applications

Journal of Parallel and Distributed Computing
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

IEEE Transactions on Computers
Introduction to Reliable Distributed Programming

Introduction to Reliable Distributed Programming
An efficient delay-optimal distributed termination detection algorithm

Journal of Parallel and Distributed Computing
Tiered Algorithm for Distributed Process Quiescence and Termination Detection

IEEE Transactions on Parallel and Distributed Systems
On detecting termination in the crash-recovery model

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection algorithm A that has been designed for a failure-free environment into a termination detection algorithm B that can tolerate process crashes. Our transformation assumes the existence of a perfect failure detector. We show that a perfect failure detector is in fact necessary to solve the termination detection problem in a crash-prone distributed system even if at most one process can crash. Let @m(n,M) and @d(n,M) denote the message complexity and detection latency, respectively, of A when the system has n processes and the underlying computation exchanges M application messages. The message complexity of B is O(n+@m(n,0)) messages per failure more than the message complexity of A. Also, its detection latency is O(@d(n,0)) per failure more than that of A. Furthermore, application message size increases by at most log(f+1) bits, where f is the actual number of processes that fail during an execution. We show that, when the communication topology is fully connected, under certain realistic assumption, any fault-tolerant termination detection algorithm can be forced to exchange @W(nf) control messages in the worst-case even when at most one process may be active initially and the underlying computation does not exchange any application messages. This implies that our transformation is optimal in terms of message complexity when @m(n,0)=O(n). The fault-tolerant termination detection algorithm resulting from the transformation satisfies three desirable properties. First, it can tolerate the failure of up to n-1 processes. Second, it does not impose any overhead on the fault-sensitive termination detection algorithm until one or more processes crash. Third, it does not block the application at any time. Further, using our transformation, we derive a fault-tolerant termination detection algorithm that is the most efficient fault-tolerant termination detection algorithm that has been proposed so far to our knowledge. Our transformation can be extended to arbitrary communication topologies provided process crashes do not partition the system.