A quantitative performance analysis model for GPU architectures

Authors:
Yao Zhang;John D. Owens
Affiliations:
Department of Electrical and Computer Engineering, University of California, Davis;Department of Electrical and Computer Engineering, University of California, Davis
Venue:
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Year:
2011

Citing 0
Cited 22

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable framework for mapping streaming applications onto multi-GPU systems

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Break down GPU execution time with an analytical method

Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Performance models for asynchronous data transfers on consumer Graphics Processing Units

Journal of Parallel and Distributed Computing
Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems

Proceedings of the VLDB Endowment
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Influence of memory access patterns to small-scale FFT performance

The Journal of Supercomputing
uBench: exposing the impact of CUDA block geometry in terms of performance

The Journal of Supercomputing
Memory performance estimation of CUDA programs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
GPU code generation for ODE-based applications with phased shared-data access patterns

ACM Transactions on Architecture and Code Optimization (TACO)
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Efficient Instrumentation of GPGPU Applications Using Information Flow Analysis and Symbolic Execution

Proceedings of Workshop on General Purpose Processing Using GPUs
CPU+GPU scheduling with asymptotic profiling

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5--15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.