Efficient reduction for wait-free termination detection in a crash-prone distributed system

Authors:
Neeraj Mittal;Felix C. Freiling;S. Venkatesan;Lucia Draque Penso
Affiliations:
Department of Computer Science, The University of Texas at Dallas, Richardson, TX;Department of Computer Science, RWTH Aachen University, Aachen, Germany;Department of Computer Science, The University of Texas at Dallas, Richardson, TX;Department of Computer Science, RWTH Aachen University, Aachen, Germany
Venue:
DISC'05 Proceedings of the 19th international conference on Distributed Computing
Year:
2005

Citing 16
Cited 6

A new approach to detection of locally indicative stability

International Colloquium on Automata, Languages and Programming on Automata, languages and programming
Consensus in the presence of partial synchrony

Journal of the ACM (JACM)
Global quiescence detection based on credit distribution and recovery

Information Processing Letters
A message-optimal algorithm for distributed termination detection

Journal of Parallel and Distributed Computing
The derivation of distributed termination detection algorithms from garbage collection schemes

ACM Transactions on Programming Languages and Systems (TOPLAS)
An (N -1)-Resilient Algorithm for Distributed Termination Detection

IEEE Transactions on Parallel and Distributed Systems
Detecting termination by weight-throwing in a faulty distributed system

Journal of Parallel and Distributed Computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Distributed Termination

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed Reset

IEEE Transactions on Computers
(Im)Possibilities of Predicate Detection in Crash-Affected Systems

WSS '01 Proceedings of the 5th International Workshop on Self-Stabilizing Systems
Termination detection in data-driven parallel computations/applications

Journal of Parallel and Distributed Computing
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

IEEE Transactions on Computers
The weakest failure detectors to solve certain fundamental problems in distributed computing

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
An efficient delay-optimal distributed termination detection algorithm

Journal of Parallel and Distributed Computing

Safe termination detection in an asynchronous distributed system when processes may crash and recover

Theoretical Computer Science
Brief announcement: termination detection in an asynchronous distributed system with crash-recovery failures

SSS'06 Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systems
The failure detector abstraction

ACM Computing Surveys (CSUR)
Safe termination detection in an asynchronous distributed system when processes may crash and recover

OPODIS'06 Proceedings of the 10th international conference on Principles of Distributed Systems
On detecting termination in the crash-recovery model

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Adoption protocols for fanout-optimal fault-tolerant termination detection

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection algorithm $\mathcal{A}$ that has been designed for a failure-free environment into a termination detection algorithm $\mathcal{B}$ that can tolerate process crashes. Our transformation assumes the existence of a perfect failure detector. We show that a perfect failure detector is in fact necessary to solve the termination detection problem in a crash-prone distributed system even if at most one process can crash. Let μ(n,M) and δ(n,M) denote the message complexity and detection latency, respectively, of $\mathcal{A}$ when the system has n processes and the underlying computation exchanges M application messages. The message complexity of $\mathcal{B}$ is at most O(n + μ(n,0)) messages per failure more than the message complexity of $\mathcal{A}$. Also, its detection latency is at most O(δ(n,0)) per failure more than that of $\mathcal{A}$. Furthermore, the overhead (that is, the amount of control data piggybacked) on an application message increases by only O(log n) bits per failure. The fault-tolerant termination detection algorithm resulting from the transformation satisfies two desirable properties. First, it can tolerate failure of up to n–1 processes, that is, it is wait-free. Second, it does not impose any overhead on the fault-sensitive termination detection algorithm until one or more processes crash, that is, it is fault-reactive. Our transformation can be extended to arbitrary communication topologies provided process crashes do not partition the system.