An (N -1)-Resilient Algorithm for Distributed Termination Detection
IEEE Transactions on Parallel and Distributed Systems
Efficient, portable implementation of asynchronous multi-place programs
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable communication protocols for dynamic sparse data exchange
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
AM++: a generalized active message framework
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Active pebbles: parallel programming for data-driven applications
Proceedings of the international conference on Supercomputing
Efficient reduction for wait-free termination detection in a crash-prone distributed system
DISC'05 Proceedings of the 19th international conference on Distributed Computing
Work stealing and persistence-based load balancers for iterative overdecomposed applications
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
On detecting termination in the crash-recovery model
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Resilient X10: efficient failure-aware programming
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.