Global events and global breakpoints in distributed systems
Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Supporting reverse execution for parallel programs
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Systematic macrostep debugging of message passing parallel programs
Future Generation Computer Systems - Special issue on distributed and parallel systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Deadlock-free incremental replay of message-passing programs
Journal of Parallel and Distributed Computing
Adaptive Message Logging for Incremental Program Replay
IEEE Parallel & Distributed Technology: Systems & Technology
An Efficient Logging Algorithm for Incremental Replay of Message
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
On the Effectiveness of Distributed Checkpoint Algorithms for Domino-Free Recovery
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Sender-based message logging for reducing rollback propagation
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Maximum and minimum consistent global checkpoints and their applications
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Progressive Construction of Consistent Global Checkpoints
ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Hi-index | 0.00 |
Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the programs state and detecting errors. For long-running parallel programs, the biggest drawback is the cost associated with restarting the programs?? execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows a program run to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points may be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows users to select an arbitrary checkpoint as a starting point for debugging and thus to shorten the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.