ROS: the rollback-one-step method to minimize the waiting time during debugging long-running parallel programs

Authors:
Nam Thoai;Dieter Kranzlmller;Jens Volkert
Affiliations:
GUP Linz, Johannes Kepler University Linz, Linz, Austria, Europe;GUP Linz, Johannes Kepler University Linz, Linz, Austria, Europe;GUP Linz, Johannes Kepler University Linz, Linz, Austria, Europe
Venue:
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Year:
2002

Citing 18
Cited 0

Global events and global breakpoints in distributed systems

Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Supporting reverse execution for parallel programs

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
Systematic macrostep debugging of message passing parallel programs

Future Generation Computer Systems - Special issue on distributed and parallel systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Deadlock-free incremental replay of message-passing programs

Journal of Parallel and Distributed Computing
Adaptive Message Logging for Incremental Program Replay

IEEE Parallel & Distributed Technology: Systems & Technology
An Efficient Logging Algorithm for Incremental Replay of Message

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Cheaper Matrix Clocks

WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
On the Effectiveness of Distributed Checkpoint Algorithms for Domino-Free Recovery

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Sender-based message logging for reducing rollback propagation

SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Maximum and minimum consistent global checkpoints and their applications

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Progressive Construction of Consistent Global Checkpoints

ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the programs state and detecting errors. For long-running parallel programs, the biggest drawback is the cost associated with restarting the programs?? execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows a program run to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points may be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows users to select an arbitrary checkpoint as a starting point for debugging and thus to shorten the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.