Breaking the GPU programming barrier with the auto-parallelising SAC compiler

Authors:
Jing Guo;Jeyarajan Thiyagalingam;Sven-Bodo Scholz
Affiliations:
University of Hertfordshire, Hatfield, United Kingdom;University of Oxford, Oxford, United Kingdom;University of Hertfordshire, Hatfield, United Kingdom
Venue:
Proceedings of the sixth workshop on Declarative aspects of multicore programming
Year:
2011

Citing 10
Cited 3

Single Assignment C: efficient support for high-level array operations in a functional setting

Journal of Functional Programming
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
hiCUDA: a high-level directive-based language for GPU programming

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Numerical Simulations of Unsteady Shock Wave Interactions Using SaC and Fortran-90

PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
A flexible high-performance Lattice Boltzmann GPU code for the simulations of fluid flows in complex geometries

Concurrency and Computation: Practice & Experience
Implementing the PGI Accelerator model

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I

Mainstream parallel array programming on cell

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Single assignment C (SAC) high productivity meets high performance: high productivity meets high performance

CEFP'11 Proceedings of the 4th Summer School conference on Central European Functional Programming School
Nested data-parallelism on the gpu

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over recent years, the use of Graphics Processing Units (GPUs) for general-purpose computing has become increasingly popular. The main reasons for this development are the attractive performance/price and performance/power ratios of these architectures. However, substantial performance gains from GPUs come at a price: they require extensive programming expertise and, typically, a substantial re-coding effort. Although the programming experience has been significantly improved by existing frameworks like CUDA and OpenCL, it is still a challenge to effectively utilise these devices. Directive-based approaches such as hiCUDA or OpenMP-variants offer further improvements but have not eliminated the need for the expertise on these complex architectures. Similarly, special purpose programming languages such as Microsoft's Accelerator try to lower the barrier further. They provide the programmer with a special form of GPU data structures and operations on them which are then compiled into GPU code. In this paper, we take this trend towards a completely implicit, high-level approach yet another step further. We generate CUDA code from a MATLAB-like high level functional array programming language, Single Assignment C (SaC). To do so, we identify which data structures and operations can be successfully mapped on GPUs and transform existing programs accordingly. This paper presents the first runtime results from our GPU backend and it presents the basic set of GPU-specific program optimisations that turned out to be essential. Despite our high-level program specifications, we show that for a number of benchmarks speedups between a factor of 5 and 50 can be achieved through our parallelising compiler.