Debugging Parallel Programs with Instant Replay
IEEE Transactions on Computers
A unified geometric approach to graph separators
SFCS '91 Proceedings of the 32nd annual symposium on Foundations of computer science
Multilevel k-way partitioning scheme for irregular graphs
Journal of Parallel and Distributed Computing
MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Dynamic software testing of MPI applications with umpire
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Net-dbx: A Web-Based Debugger of MPI Programs Over Low-Bandwidth Lines
IEEE Transactions on Parallel and Distributed Systems
CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
An Implementation of Race Detection and Deterministic Replay with MPI
Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Modeling wildcard-free MPI programs for verification
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance Evaluation of View-Oriented Parallel Programming
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Automated, scalable debugging of MPI programs with Intel® Message Checker
Proceedings of the second international workshop on Software engineering for high performance computing system applications
Using model checking with symbolic execution to verify parallel numerical programs
Proceedings of the 2006 international symposium on Software testing and analysis
Concurrent deadlock detection in parallel programs
International Journal of Computers and Applications
Novel techniques for debugging and optimizing parallel applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Model checking nonblocking MPI programs
VMCAI'07 Proceedings of the 8th international conference on Verification, model checking, and abstract interpretation
R2: an application-level kernel for record and replay
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Collective error detection for MPI collective operations
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Retrospect: deterministic replay of MPI applications for interactive distributed debugging
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Correctness debugging of message passing programs using model verification techniques
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Practical model-checking method for verifying correctness of MPI programs
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Bringing Reverse Debugging to HPC
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
FACT: fast communication trace collection for parallel applications through program slicing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
LReplay: a pending period based deterministic replay scheme
Proceedings of the 37th annual international symposium on Computer architecture
R2: an application-level kernel for record and replay
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Robust non-intrusive record-replay with processor extraction
Proceedings of the 8th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
A Scalable and Distributed Dynamic Formal Verifier for MPI Programs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Toward generating reducible replay logs
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Deterministic replay for MCAPI programs
Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
Exploiting parallelism in deterministic shared memory multiprocessing
Journal of Parallel and Distributed Computing
Deterministic replay for message-passing-based concurrent programs
ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special section on verification challenges in the concurrent world
Elastic and scalable tracing and accurate replay of non-deterministic events
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
Message Passing Interface (MPI) is a widely used standard for managing coarse-grained concurrency on distributed computers. Debugging parallel MPI applications, however, has always been a particularly challenging task due to their high degree of concurrent execution and non-deterministic behavior. Deterministic replay is a potentially powerful technique for addressing these challenges, with existing MPI replay tools adopting either data-replay or order-replay approaches. Unfortunately, each approach has its tradeoffs. Data-replay generates substantial log sizes by recording every communication message. Order-replay generates small logs, but requires all processes to be replayed together. We believe that these drawbacks are the primary reasons that inhibit the wide adoption of deterministic replay as the critical enabler of cyclic debugging of MPI applications. This paper describes subgroup reproducible replay (SRR), a hybrid deterministic replay method that provides the benefits of both data-replay and order-replay while balancing their trade-offs. SRR divides all processes into disjoint groups. It records the contents of messages crossing group boundaries as in data-replay, but records just message orderings for communication within a group as in order-replay. In this way, SRR can exploit the communication locality of traffic patterns in MPI applications. During replay, developers can then replay each group individually. SRR reduces recording overhead by not recording intra-group communication, and reduces replay overhead by limiting the size of each replay group. Exposing these tradeoffs gives the user the necessary control for making deterministic replay practical for MPI applications. We have implemented a prototype, MPIWiz, to demonstrate and evaluate SRR. MPIWiz employs a replay framework that allows transparent binary instrumentation of both library and system calls. As a result, MPIWiz replays MPI applications with no source code modification and relinking, and handles non-determinism in both MPI and OS system calls. Our preliminary results show that MPIWiz can reduce recording overhead by over a factor of four relative to data-replay, yet without requiring the entire application to be replayed as in order-replay. Recording increases execution time by 27% while the application can be replayed in just 53% of its base execution time.