Robust non-intrusive record-replay with processor extraction

Authors:
Filippo Gioachin;Gengbin Zheng;Laxmikant V. Kalé
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of the 8th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
Year:
2010

Citing 16
Cited 1

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Debugging distributed C programs by real time reply

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Supporting reverse execution for parallel programs

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Optimal tracing and replay for debugging message-passing parallel programs

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
RecPlay: a fully integrated practical record/replay system

ACM Transactions on Computer Systems (TOCS)
Efficient algorithms for bidirectional debugging

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An Efficient Logging Algorithm for Incremental Replay of Message

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Critical-Path-Based Message Logging for Incremental Replay of Message-Passing Programs

Critical-Path-Based Message Logging for Incremental Replay of Message-Passing Programs
Re-execution of Distributed Programs to Detect Bugs Hidden by Racing

HICSS '97 Proceedings of the 30th Hawaii International Conference on System Sciences: Software Technology and Architecture - Volume 1
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Simulation-based performance prediction for large parallel machines

International Journal of Parallel Programming - Special issue: The next generation software program
A system integration framework for coupled multiphysics simulations

Engineering with Computers
MPIWiz: subgroup reproducible replay of mpi applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Debugging large scale applications in a virtualized environment

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of increasingly larger parallel machines, debugging is becoming more and more challenging. In particular, applications at this scale tend to behave non-deterministically, leading to race condition bugs. Furthermore, gaining access to these large machines for long debugging sessions is generally infeasible. In this paper, we present a 3-step algorithm to perform what we call "processor extraction": a procedure to record the execution of a set of processors from a parallel application, and replay any of them in a controlled environment. Our technique generates very low interference in the recorded program thanks to the separation between non-determinism elimination, and detailed processor recording. In order to improve robustness and accuracy, we further augmented our algorithm with a self-correction mechanism.