Error detection in large-scale parallel programs with long runtimes

Authors:
Dieter Kranzlmüller;Nam Thoai;Jens Volkert
Affiliations:
GUP Linz, Johannes Kepler University Linz, Altenbergerstr. 69, A-4040 Linz, Austria;GUP Linz, Johannes Kepler University Linz, Altenbergerstr. 69, A-4040 Linz, Austria;GUP Linz, Johannes Kepler University Linz, Altenbergerstr. 69, A-4040 Linz, Austria
Venue:
Future Generation Computer Systems - Tools for program development and analysis
Year:
2003

Citing 16
Cited 0

A probe effect in concurrent programs

Software—Practice & Experience
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Techniques for debugging parallel programs with flowback analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Debugging: creative techniques and tools for software repair

Debugging: creative techniques and tools for software repair
Panorama: a portable, extensible parallel debugger

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Optimal tracing and incremental reexecution for debugging long-running programs

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
The p2d2 project: building a portable distributed debugger

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
An experiment in tool integration: the DDBG parallel and distributed debugger

Journal of Systems Architecture: the EUROMICRO Journal
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
An Execution-Backtracking Approach to Debugging

IEEE Software
An Efficient Logging Algorithm for Incremental Replay of Message

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
The Design of the General Parallel Monitoring System

Proceedings of the IFIP WG 10.3 Workshop on Programming Environments for Parallel Computing
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

Quantified Score

Hi-index	0.03

Visualization

Abstract

Error detection is an important activity of program development, which is applied to detect incorrect computations or runtime failures of software. The costs of debugging are strongly related to the complexity and the scale of the investigated programs. Both characteristics are especially cumbersome for large-scale parallel programs with long runtimes, which are quite common in computational science and engineering (CSE) applications. A solution is offered by a combination of techniques using the event graph model as a representation of parallel program behaviour. With process isolation, a subset of the original number of processes can be investigated, while the absent processes are simulated by the debugging system. With checkpointing, an arbitrary temporal section of a program's runtime can be extracted for exhaustive analysis without the need to restart the program from the beginning. Additional benefits of the event graph are support of equivalent execution of nondeterministic programs, as well as a comprehensible visualisation as a space-time diagram.