DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
In this work we describe the implementation of a practical mechanism for collecting and displaying trace information in a debugger for message passing programs. We introduce a trace format that is highly compressible while still providing information adequate for debugging purposes. We make the mechanism convenient for users to access by incorporating the trace collection in a set of wrappers for the MPI communication library. We implement several debugger operations that use the trace display: consistent stoplines, undo, and rollback. They all are implemented using controlled replay, which executes at full speed in target processes until the appropriate position in the computation is reached. They provide convenient mechanisms for getting to places in the execution where the full power of a state-based debugger can be brought to bear on isolating communication errors.