Debugging Parallel Programs with Instant Replay
IEEE Transactions on Computers
Global events and global breakpoints in distributed systems
Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
Supporting reverse execution for parallel programs
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Optimal tracing and replay for debugging message-passing parallel programs
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Deadlock-free incremental replay of message-passing programs
Journal of Parallel and Distributed Computing
Adaptive Message Logging for Incremental Program Replay
IEEE Parallel & Distributed Technology: Systems & Technology
Fundamentals of Distributed System Observation
IEEE Software
An Efficient Logging Algorithm for Incremental Replay of Message
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
NOPE: A Nondeterministic Program Evaluator
ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
MPL*: Efficient Record/Play of Nondeterministic Features of Message Passing Libraries
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Grids and grid technologies for wide-area distributed computing
Software—Practice & Experience
Maximum and minimum consistent global checkpoints and their applications
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Synchronous, asynchronous, and causally ordered communication
Distributed Computing
Software Tools for High-Performance Computiing: Survey and Recommendations
Scientific Programming
A debugger for flow graph based parallel applications
Proceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging
Distributed debugging for mobile networks
Journal of Systems and Software
Hi-index | 0.00 |
Applications running on HPC Platforms, PC clusters, or computational grids are often long-running parallel programs. Debugging these programs is a challenge due to the lack of efficient debugging tools and the inherent possibility of nondeterminism in parallel programs. To overcome the problem of nondeterminism, several sophisticated record&replay mechanisms have been developed. However, the substantial problem of the waiting time during re-execution was not sufficiently investigated in the past. This paper shows that the waiting time is in some cases unlimited with currently available methods, which prohibits efficient interactive debugging tools. In contrast, the new shortcut replay method combines checkpointing and debugging techniques. It controls the replayed execution based on the trace data in order to minimize the waiting time during debugging long-running parallel programs.