A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Authors:
Allen Leung;Nicolas Vasilache;Benoît Meister;Muthu Baskaran;David Wohlford;Cédric Bastoul;Richard Lethin
Affiliations:
Reservoir Labs, New York, NY;Reservoir Labs, New York, NY;Reservoir Labs, New York, NY;Reservoir Labs, New York, NY;Reservoir Labs, New York, NY;Reservoir Labs, New York, NY;Reservoir Labs, New York, NY
Venue:
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Year:
2010

Citing 10
Cited 17

Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed

Mathematics of Operations Research
High performance computing: challenges for future systems

High performance computing: challenges for future systems
Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
CuMAPz: a tool to analyze memory access patterns in CUDA

Proceedings of the 48th Design Automation Conference
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
FLAT: a GPU programming framework to provide embedded MPI

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Memory reuse optimizations in the R-Stream compiler

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA

Proceedings of the Conference on Design, Automation and Test in Europe
Memory performance estimation of CUDA programs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Generating efficient data movement code for heterogeneous architectures with distributed-memory

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hybrid Hexagonal/Classical Tiling for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability. This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language. The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.