TA UoverSupermon: low-overhead online parallel performance monitoring

  • Authors:
  • Aroon Nataraj;Matthew Sottile;Alan Morris;Allen D. Malony;Sameer Shende

  • Affiliations:
  • Department of Computer and Information Science, University of Oregon, Eugene, OR;Los Alamos National Laboratory, Los Alamos, NM;Department of Computer and Information Science, University of Oregon, Eugene, OR;Department of Computer and Information Science, University of Oregon, Eugene, OR;Department of Computer and Information Science, University of Oregon, Eugene, OR

  • Venue:
  • Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very lowoverhead application monitoring as well as other benefits unavailable from using a transport such as NFS.