OCM—a monitoring system for interoperable tools
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
An object-based infrastructure for program monitoring and steering
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
The Autopilot performance-directed adaptive control system
Future Generation Computer Systems - I. High Performance Numerical Methods and Applications. II. Performance Data Mining: Automated Diagnosis, Adaption, and Optimization
Active harmony: towards automated performance tuning
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Supermon: A High-Speed Cluster Monitoring System
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Falcon: on-line monitoring and steering of large-scale parallel programs
FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Uintah: A Massively Parallel Problem Solving Environment
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Monitoring Large Systems Via Statistical Sampling
International Journal of High Performance Computing Applications
On-line automated performance diagnosis on thousands of processes
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
TAUg: runtime global performance data access using MPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Lessons learned at 208K: towards debugging millions of cores
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Observing Performance Dynamics Using Parallel Profile Snapshots
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
TAUmon: scalable online performance data analysis in TAU
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Hi-index | 0.00 |
Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very lowoverhead application monitoring as well as other benefits unavailable from using a transport such as NFS.