Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Elimination of redundant memory traffic in high-level synthesis
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Break down GPU execution time with an analytical method
Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
BSArc: blacksmith streaming architecture for HPC accelerators
Proceedings of the 9th conference on Computing Frontiers
An efficient compiler framework for cache bypassing on GPUs
Proceedings of the International Conference on Computer-Aided Design
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Hi-index | 0.00 |
CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider several architectural details, and small changes in source code, especially on memory access pattern, affect performance significantly. This paper presents CuMAPz, a tool to compare the memory performance of a CUDA program. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for memory behavior. CuMAPz models several memory effects, e.g., data reuse, global memory access coalescing, shared memory bank conflict, channel skew, and branch divergence. By using CuMAPz to explore memory access design space, we could improve the performance of our benchmarks by 62% over the naive cases, and 32% over previous approach[8].