A probe effect in concurrent programs
Software—Practice & Experience
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Techniques for debugging parallel programs with flowback analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
Debugging: creative techniques and tools for software repair
Debugging: creative techniques and tools for software repair
Panorama: a portable, extensible parallel debugger
PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Optimal tracing and incremental reexecution for debugging long-running programs
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
The p2d2 project: building a portable distributed debugger
SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
An experiment in tool integration: the DDBG parallel and distributed debugger
Journal of Systems Architecture: the EUROMICRO Journal
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
An Execution-Backtracking Approach to Debugging
IEEE Software
An Efficient Logging Algorithm for Incremental Replay of Message
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
The Design of the General Parallel Monitoring System
Proceedings of the IFIP WG 10.3 Workshop on Programming Environments for Parallel Computing
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Hi-index | 0.03 |
Error detection is an important activity of program development, which is applied to detect incorrect computations or runtime failures of software. The costs of debugging are strongly related to the complexity and the scale of the investigated programs. Both characteristics are especially cumbersome for large-scale parallel programs with long runtimes, which are quite common in computational science and engineering (CSE) applications. A solution is offered by a combination of techniques using the event graph model as a representation of parallel program behaviour. With process isolation, a subset of the original number of processes can be investigated, while the absent processes are simulated by the debugging system. With checkpointing, an arbitrary temporal section of a program's runtime can be extracted for exhaustive analysis without the need to restart the program from the beginning. Additional benefits of the event graph are support of equivalent execution of nondeterministic programs, as well as a comprehensible visualisation as a space-time diagram.