Quantitative system performance: computer system analysis using queueing network models
Quantitative system performance: computer system analysis using queueing network models
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance Predictions for General-Purpose Computation on GPUs
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
Architecture-aware optimization targeting multithreaded stream computing
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A Micro-benchmark Suite for AMD GPUs
ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
A quantitative performance analysis model for GPU architectures
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Hi-index | 0.00 |
It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels directly. We achieve it by providing performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted, so as to facilitate the best match between algorithm features and underlying hardware characteristics. To demonstrate the usage of tuning chain, we optimize three representative GPU kernels with different compute intensity: Matrix Transpose, Laplace Transform and Integral on both NVIDIA and AMD GPUs. Experimental results demonstrate that under the guidance of our tuning chain, performance of those kernels achieves 7.8~42.4 times speedup compared to their naïve implementations on both NVIDIA and AMD GPU platforms.