Programming with POSIX threads
Programming with POSIX threads
OpenUH: an optimizing, portable OpenMP compiler: Research Articles
Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Scalability Evaluation of Barrier Algorithms for OpenMP
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
CUDA Programming: A Developer's Guide to Parallel Computing with GPUs
CUDA Programming: A Developer's Guide to Parallel Computing with GPUs
Integrating Multi-GPU Execution in an OpenACC Compiler
ICPP '13 Proceedings of the 2013 42nd International Conference on Parallel Processing
Hi-index | 0.00 |
Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are high-level programming model for users to create applications for accelerators by annotating region of code for offloading with directives. In these programming models, most of the offloaded kernels are data parallel loops processing one or multiple multi-dimensional arrays, and it is often that scalar variables are used in the parallel loop body for reduction operations. Since reduction operation itself has loop-carried dependency preventing the parallelization of the loops, this could have a significant impact on the performance if not handled properly. In this paper, we present the design and parallelization of reduction operations in parallel loops for GPGPU accelerators. Using OpenACC as the high-level directive-based programing model, we discuss how reduction operations are parallelized when appearing in each level of the loop nest and thread hierarchy. We present how we handle the mapping of the loops and parallelized reduction to single- or multiple-level parallelism of GPGPU architectures. These algorithms have been implemented in the open source OpenACC compiler OpenUH. We compare our implementation with two other commercial OpenACC compilers using test cases and applications, and demonstrate better robustness and competitive performance than others.