Debugging Large-Scale, Long-Running Parallel Programs

  • Authors:
  • Dieter Kranzlmüller;Nam Thoai;Jens Volkert

  • Affiliations:
  • -;-;-

  • Venue:
  • ICCS '02 Proceedings of the International Conference on Computational Science-Part II
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cyclic debugging depicts error detection techniques, where programs are iteratively executed to identify the original reason for incorrect runtime behavior. This characteristic is especially problematic for large-scale, long-running parallel programs concerning the requirements in time and processing resources and the associated computing costs. A solution to these problems is offered by a combination of techniques, which use the event graph model as the main representation of parallel program behavior. On the one hand, the number of deployed processes can be reduced with process isolation, where only a subset of the original processes are executed during debugging. On the other hand, an integrated checkpointing mechanism allows to extract limited periods of execution time, or to start subsequent program executions at intermediate points. Additionally, the event graph offers equivalent program execution in case of nondeterminism, as well as the possibility to investigate the effects of program perturbation induced by the observation functionality.