Memory performance estimation of CUDA programs

Authors:
Yooseong Kim;Aviral Shrivastava
Affiliations:
Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ
Venue:
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Year:
2013

Citing 12
Cited 0

Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Elimination of redundant memory traffic in high-level synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

CUDA has successfully popularized GPU computing, and GPGPU applications are now used in various embedded systems. The CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider numerous architectural details, and small changes in source code, especially on the memory access pattern, can affect performance significantly. This makes it very difficult to optimize CUDA programs. This article presents CuMAPz, which is a tool to analyze and compare the memory performance of CUDA programs. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for efficient memory behavior. CuMAPz models several memory-performance-related factors: data reuse, global memory access coalescing, global memory latency hiding, shared memory bank conflict, channel skew, and branch divergence. Experimental results show that CuMAPz can accurately estimate performance with correlation coefficient of 0.96. By using CuMAPz to explore the memory access design space, we could improve the performance of our benchmarks by 30% more than the previous approach [Hong and Kim 2010].