GPURoofline: a model for guiding performance optimizations on GPUs

Authors:
Haipeng Jia;Yunquan Zhang;Guoping Long;Jianliang Xu;Shengen Yan;Yan Li
Affiliations:
Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, China, College of Information Science and Engineering, The Ocean University of China, China;Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, China, State Key Laboratory of Computing Science, The Chinese Academy of Sciences, China;Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, China;College of Information Science and Engineering, The Ocean University of China, China;Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, China, State Key Laboratory of Computing Science, The Chinese Academy of Sciences, China, G ...;Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, China, State Key Laboratory of Computing Science, The Chinese Academy of Sciences, China, G ...
Venue:
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Year:
2012

Citing 13
Cited 2

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance Predictions for General-Purpose Computation on GPUs

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A Micro-benchmark Suite for AMD GPUs

ICPPW '10 Proceedings of the 2010 39th International Conference on Parallel Processing Workshops
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance modeling for FPGAs: extending the roofline model with high-level synthesis tools

International Journal of Reconfigurable Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels. The model addresses this problem by exploring potential performance bottlenecks and evaluating whether specific optimization techniques bring any performance improvement. To demonstrate the usage of the model, we optimize four representative kernels with different computation densities, namely matrix transpose, Laplace transform, integral and face-dection, on both NVIDIA and AMD GPUs. Experimental results show that under the guidance of GPURoofline, performance of those kernels achieves 3.74˜14.8 times speedup compared to their naïve implementations on both NVIDIA and AMD GPU platforms.