Adoption protocols for fanout-optimal fault-tolerant termination detection

Authors:
Jonathan Lifflander;Phil Miller;Laxmikant Kale
Affiliations:
University of Illinois Urbana-Champaign, Urbana, IL, USA;University of Illinois Urbana-Champaign, Urbana, IL, USA;University of Illinois Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 11
Cited 1

An (N -1)-Resilient Algorithm for Distributed Termination Detection

IEEE Transactions on Parallel and Distributed Systems
Efficient, portable implementation of asynchronous multi-place programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
AM++: a generalized active message framework

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Active pebbles: parallel programming for data-driven applications

Proceedings of the international conference on Supercomputing
Efficient reduction for wait-free termination detection in a crash-prone distributed system

DISC'05 Proceedings of the 19th international conference on Distributed Computing
Work stealing and persistence-based load balancers for iterative overdecomposed applications

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
On detecting termination in the crash-recovery model

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Resilient X10: efficient failure-aware programming

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.