Trellis: Portability across architectures with a high-level framework

  • Authors:
  • Lukasz G. Szafaryn;Todd Gamblin;Bronis R. De Supinski;Kevin Skadron

  • Affiliations:
  • -;-;-;-

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, which in turn cause the code base to diverge and makes porting difficult. Our experiences with parallel applications and frameworks lead us to the conclusion that achieving performance portability requires a common set of high-level directives and efficient mapping onto each architecture. In order to demonstrate this concept, we develop Trellis, a prototype programming framework that allows the programmer to maintain only a single generic and structured codebase that executes efficiently on both the CPU and the GPU. Our approach annotates such code with a single set of high-level directives, derived from both OpenMP and OpenACC, that is made compatible for both architectures. Most importantly, motivated by the limitations of the OpenACC compiler in transforming such code into a GPU kernel, we introduce a thread synchronization directive and a set of transformation techniques that allow us to obtain the GPU code with the desired parallelization that yields more optimal performance. While a common high-level programming framework for both CPU and GPU is not yet available, our analysis shows that even obtaining the best-case GPU performance with OpenACC, state-of-the-art solution, requires modifications to the structure of codes to properly exploit braided parallelism, and cope with conditional statements or serial sections. While this already requires prior knowledge of compiler behavior the optimal performance is still unattainable due to the lack of synchronization. We describe the contributions of Trellis in addressing these problems by showing how it can achieve correct parallelization of the original codes for three parallel applications, with performance competitive to that of OpenMP and CUDA, improved programmability and reduced overall code length.