ScalaTrace: Scalable compression and replay of communication traces for high-performance computing

  • Authors:
  • Michael Noeth;Prasun Ratn;Frank Mueller;Martin Schulz;Bronis R. de Supinski

  • Affiliations:
  • North Carolina State University, Department of Computer Science, Raleigh, NC 27695-7534, United States;North Carolina State University, Department of Computer Science, Raleigh, NC 27695-7534, United States;North Carolina State University, Department of Computer Science, Raleigh, NC 27695-7534, United States;Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA 94551, United States;Lawrence Livermore National Laboratory, Center for Applied Scientific Computing, Livermore, CA 94551, United States

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Characterizing the communication behavior of large-scale applications is a difficult and costly task due to code/system complexity and long execution times. While many tools to study this behavior have been developed, these approaches either aggregate information in a lossy way through high-level statistics or produce huge trace files that are hard to handle. We contribute an approach that provides orders of magnitude smaller, if not near-constant size, communication traces regardless of the number of nodes while preserving structural information. We introduce intra- and inter-node compression techniques of MPI events that are capable of extracting an application's communication structure. We further present a replay mechanism for the traces generated by our approach and discuss results of our implementation for BlueGene/L. Given this novel capability, we discuss its impact on communication tuning and beyond. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with deterministic MPI call replay is without any precedent.