An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

  • Authors:
  • Xin Huo;Vignesh Ravi;Wenjing Ma;Gagan Agrawal

  • Affiliations:
  • The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA

  • Venue:
  • Proceedings of the international conference on Supercomputing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

GPUs have rapidly emerged as a very significant player in high performance computing. However, despite the popularity of CUDA, there are significant challenges in porting different classes of HPC applications on modern GPUs. This paper focuses on the challenges of implementing irregular applications arising from unstructured grids on modern NVIDIA GPUs. Considering the importance of irregular reductions in scientific and engineering codes, substantial effort was made in developing compiler and runtime support for parallelization or optimization of these codes in the previous two decades, with different efforts targeting distributed memory machines, distributed shared memory machines, shared memory machines, or cache performance improvement on uniprocessor machines. However, there have not been any systematic studies on parallelizing these applications on modern GPUs. There are at least two significant challenges associated with porting this class of applications on modern GPUs. The first is related to correct and efficient parallelization while using a large number of threads. The second challenge is effective use of shared memory. Since data accesses cannot be determined statically, runtime partitioning methods are needed for effectively using the shared memory. This paper describes an execution methodology that can address the above two challenges. We have also developed optimized runtime modules to support our execution methodology. Our approach and runtime methods have been extensively evaluated using two indirection array based applications.