ROS: the rollback-one-step method to minimize the waiting time during debugging long-running parallel programs

  • Authors:
  • Nam Thoai;Dieter Kranzlmller;Jens Volkert

  • Affiliations:
  • GUP Linz, Johannes Kepler University Linz, Linz, Austria, Europe;GUP Linz, Johannes Kepler University Linz, Linz, Austria, Europe;GUP Linz, Johannes Kepler University Linz, Linz, Austria, Europe

  • Venue:
  • VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the programs state and detecting errors. For long-running parallel programs, the biggest drawback is the cost associated with restarting the programs?? execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows a program run to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points may be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows users to select an arbitrary checkpoint as a starting point for debugging and thus to shorten the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.