Efficient reduction for wait-free termination detection in a crash-prone distributed system

  • Authors:
  • Neeraj Mittal;Felix C. Freiling;S. Venkatesan;Lucia Draque Penso

  • Affiliations:
  • Department of Computer Science, The University of Texas at Dallas, Richardson, TX;Department of Computer Science, RWTH Aachen University, Aachen, Germany;Department of Computer Science, The University of Texas at Dallas, Richardson, TX;Department of Computer Science, RWTH Aachen University, Aachen, Germany

  • Venue:
  • DISC'05 Proceedings of the 19th international conference on Distributed Computing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection algorithm $\mathcal{A}$ that has been designed for a failure-free environment into a termination detection algorithm $\mathcal{B}$ that can tolerate process crashes. Our transformation assumes the existence of a perfect failure detector. We show that a perfect failure detector is in fact necessary to solve the termination detection problem in a crash-prone distributed system even if at most one process can crash. Let μ(n,M) and δ(n,M) denote the message complexity and detection latency, respectively, of $\mathcal{A}$ when the system has n processes and the underlying computation exchanges M application messages. The message complexity of $\mathcal{B}$ is at most O(n + μ(n,0)) messages per failure more than the message complexity of $\mathcal{A}$. Also, its detection latency is at most O(δ(n,0)) per failure more than that of $\mathcal{A}$. Furthermore, the overhead (that is, the amount of control data piggybacked) on an application message increases by only O(log n) bits per failure. The fault-tolerant termination detection algorithm resulting from the transformation satisfies two desirable properties. First, it can tolerate failure of up to n–1 processes, that is, it is wait-free. Second, it does not impose any overhead on the fault-sensitive termination detection algorithm until one or more processes crash, that is, it is fault-reactive. Our transformation can be extended to arbitrary communication topologies provided process crashes do not partition the system.