Error detection in large-scale parallel programs with long runtimes

  • Authors:
  • Dieter Kranzlmüller;Nam Thoai;Jens Volkert

  • Affiliations:
  • GUP Linz, Johannes Kepler University Linz, Altenbergerstr. 69, A-4040 Linz, Austria;GUP Linz, Johannes Kepler University Linz, Altenbergerstr. 69, A-4040 Linz, Austria;GUP Linz, Johannes Kepler University Linz, Altenbergerstr. 69, A-4040 Linz, Austria

  • Venue:
  • Future Generation Computer Systems - Tools for program development and analysis
  • Year:
  • 2003

Quantified Score

Hi-index 0.03

Visualization

Abstract

Error detection is an important activity of program development, which is applied to detect incorrect computations or runtime failures of software. The costs of debugging are strongly related to the complexity and the scale of the investigated programs. Both characteristics are especially cumbersome for large-scale parallel programs with long runtimes, which are quite common in computational science and engineering (CSE) applications. A solution is offered by a combination of techniques using the event graph model as a representation of parallel program behaviour. With process isolation, a subset of the original number of processes can be investigated, while the absent processes are simulated by the debugging system. With checkpointing, an arbitrary temporal section of a program's runtime can be extracted for exhaustive analysis without the need to restart the program from the beginning. Additional benefits of the event graph are support of equivalent execution of nondeterministic programs, as well as a comprehensible visualisation as a space-time diagram.