Reduction Operations in Parallel Loops for GPGPUs

  • Authors:
  • Rengan Xu;Xiaonan Tian;Yonghong Yan;Sunita Chandrasekaran;Barbara Chapman

  • Affiliations:
  • Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA

  • Venue:
  • Proceedings of Programming Models and Applications on Multicores and Manycores
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are high-level programming model for users to create applications for accelerators by annotating region of code for offloading with directives. In these programming models, most of the offloaded kernels are data parallel loops processing one or multiple multi-dimensional arrays, and it is often that scalar variables are used in the parallel loop body for reduction operations. Since reduction operation itself has loop-carried dependency preventing the parallelization of the loops, this could have a significant impact on the performance if not handled properly. In this paper, we present the design and parallelization of reduction operations in parallel loops for GPGPU accelerators. Using OpenACC as the high-level directive-based programing model, we discuss how reduction operations are parallelized when appearing in each level of the loop nest and thread hierarchy. We present how we handle the mapping of the loops and parallelized reduction to single- or multiple-level parallelism of GPGPU architectures. These algorithms have been implemented in the open source OpenACC compiler OpenUH. We compare our implementation with two other commercial OpenACC compilers using test cases and applications, and demonstrate better robustness and competitive performance than others.