Automatic restructuring of GPU kernels for exploiting inter-thread data locality

  • Authors:
  • Swapneela Unkule;Christopher Shaltz;Apan Qasem

  • Affiliations:
  • Texas State University, San Marcos, TX;Texas State University, San Marcos, TX;Texas State University, San Marcos, TX

  • Venue:
  • CC'12 Proceedings of the 21st international conference on Compiler Construction
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in today's HPC world. For many applications, however, achieving a high fraction of peak on current GPUs, still requires significant programmer effort. A key consideration for optimizing GPU code is determining a suitable amount of work to be performed by each thread. Thread granularity not only has a direct impact on occupancy but can also influence data locality at the register and shared-memory levels. This paper describes a software framework to analyze dependencies in parallel GPU threads and perform source-level restructuring to obtain GPU kernels with varying thread granularity. The framework supports specification of coarsening factors through source-code annotation and also implements a heuristic based on estimated register pressure that automatically recommends coarsening factors for improved memory performance. We present preliminary experimental results on a select set of CUDA kernels. The results show that the proposed strategy is generally able to select profitable coarsening factors. More importantly, the results demonstrate a clear need for automatic control of thread granularity at the software level for achieving higher performance.