Effective sampling-driven performance tools for GPU-accelerated supercomputers

Authors:
Milind Chabbi;Karthik Murthy;Michael Fagan;John Mellor-Crummey
Affiliations:
Rice University Houston,TX;Rice University Houston,TX;Rice University Houston,TX;Rice University Houston,TX
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 15
Cited 0

Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Exploiting hardware performance counters with flow and context sensitive profiling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Binary analysis for measurement and attribution of program performance

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Analyzing lock contention in multithreaded applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org

Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Scalable fine-grained call path tracing

Proceedings of the international conference on Supercomputing
Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Computing in Science and Engineering
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.