uBench: exposing the impact of CUDA block geometry in terms of performance

Authors:
Yuri Torres;Arturo Gonzalez-Escribano;Diego R. Llanos
Affiliations:
Universidad de Valladolid, Valladolid, Spain;Universidad de Valladolid, Valladolid, Spain;Universidad de Valladolid, Valladolid, Spain
Venue:
The Journal of Supercomputing
Year:
2013

Citing 7
Cited 0

A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Bounding the effect of partition camping in GPU kernels

Proceedings of the 8th ACM International Conference on Computing Frontiers
Using Fermi Architecture Knowledge to Speed up CUDA and OpenCL Programs

ISPA '12 Proceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.