Shortcut Replay: A Replay Technique for Debugging Long-Running Parallel Programs

Authors:
Nam Thoai;Dieter Kranzlmüller;Jens Volkert
Affiliations:
-;-;-
Venue:
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Year:
2002

Citing 19
Cited 2

Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
Global events and global breakpoints in distributed systems

Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
Supporting reverse execution for parallel programs

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Optimal tracing and replay for debugging message-passing parallel programs

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Deadlock-free incremental replay of message-passing programs

Journal of Parallel and Distributed Computing
Adaptive Message Logging for Incremental Program Replay

IEEE Parallel & Distributed Technology: Systems & Technology
Fundamentals of Distributed System Observation

IEEE Software
An Efficient Logging Algorithm for Incremental Replay of Message

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
NOPE: A Nondeterministic Program Evaluator

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
MPL*: Efficient Record/Play of Nondeterministic Features of Message Passing Libraries

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Grids and grid technologies for wide-area distributed computing

Software—Practice & Experience
Maximum and minimum consistent global checkpoints and their applications

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Synchronous, asynchronous, and causally ordered communication

Distributed Computing
Software Tools for High-Performance Computiing: Survey and Recommendations

Scientific Programming

A debugger for flow graph based parallel applications

Proceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging
Distributed debugging for mobile networks

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications running on HPC Platforms, PC clusters, or computational grids are often long-running parallel programs. Debugging these programs is a challenge due to the lack of efficient debugging tools and the inherent possibility of nondeterminism in parallel programs. To overcome the problem of nondeterminism, several sophisticated record&replay mechanisms have been developed. However, the substantial problem of the waiting time during re-execution was not sufficiently investigated in the past. This paper shows that the waiting time is in some cases unlimited with currently available methods, which prohibits efficient interactive debugging tools. In contrast, the new shortcut replay method combines checkpointing and debugging techniques. It controls the replayed execution based on the trace data in order to minimize the waiting time during debugging long-running parallel programs.