Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Dynamic software testing of MPI applications with umpire
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem
Combining Partial Order Reductions with On-the-fly Model-Checking
CAV '94 Proceedings of the 6th International Conference on Computer Aided Verification
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Modeling wildcard-free MPI programs for verification
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Concurrent deadlock detection in parallel programs
International Journal of Computers and Applications
ISP: a tool for model checking MPI programs
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
CAV '08 Proceedings of the 20th international conference on Computer Aided Verification
Overcoming Scalability Challenges for Tool Daemon Launching
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
A Scalable Tools Communications Infrastructure
HPCS '08 Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications
A graph based approach for MPI deadlock detection
Proceedings of the 23rd international conference on Supercomputing
SPEC MPI2007—an application benchmark suite for parallel systems using MPI
Concurrency and Computation: Practice & Experience - International Supercomputing Conference (ISC07)
A framework for scalable, parallel performance monitoring
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Order preserving event aggregation in TBONs
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
GTI: A Generic Tools Infrastructure for Event-Based Tools in Parallel Systems
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
MPI runtime error detection with MUST: advances in deadlock detection
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.