Reduction Operations in Parallel Loops for GPGPUs

Authors:
Rengan Xu;Xiaonan Tian;Yonghong Yan;Sunita Chandrasekaran;Barbara Chapman
Affiliations:
Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA;Dept. of Computer Science, University of Houston, Houston, TX, 77004, USA
Venue:
Proceedings of Programming Models and Applications on Multicores and Manycores
Year:
2014

Citing 5
Cited 0

Programming with POSIX threads

Programming with POSIX threads
OpenUH: an optimizing, portable OpenMP compiler: Research Articles

Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Scalability Evaluation of Barrier Algorithms for OpenMP

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
CUDA Programming: A Developer's Guide to Parallel Computing with GPUs

CUDA Programming: A Developer's Guide to Parallel Computing with GPUs
Integrating Multi-GPU Execution in an OpenACC Compiler

ICPP '13 Proceedings of the 2013 42nd International Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are high-level programming model for users to create applications for accelerators by annotating region of code for offloading with directives. In these programming models, most of the offloaded kernels are data parallel loops processing one or multiple multi-dimensional arrays, and it is often that scalar variables are used in the parallel loop body for reduction operations. Since reduction operation itself has loop-carried dependency preventing the parallelization of the loops, this could have a significant impact on the performance if not handled properly. In this paper, we present the design and parallelization of reduction operations in parallel loops for GPGPU accelerators. Using OpenACC as the high-level directive-based programing model, we discuss how reduction operations are parallelized when appearing in each level of the loop nest and thread hierarchy. We present how we handle the mapping of the loops and parallelized reduction to single- or multiple-level parallelism of GPGPU architectures. These algorithms have been implemented in the open source OpenACC compiler OpenUH. We compare our implementation with two other commercial OpenACC compilers using test cases and applications, and demonstrate better robustness and competitive performance than others.