Kokkos Array performance-portable manycore programming model

Authors:
H. Carter Edwards;Daniel Sunderland
Affiliations:
Sandia National Laboratories, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM
Venue:
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Year:
2012

Citing 4
Cited 0

C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ in Depth Series)

C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ in Depth Series)
Intel threading building blocks

Intel threading building blocks
Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
GPU Computing Gems Jade Edition

GPU Computing Gems Jade Edition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large, complex scientific and engineering application code have a significant investment in computational kernels which implement their mathematical models. Porting these computational kernels to multicore-CPU and manycore-accelerator (e.g., NVIDIA® GPU) devices is a major challenge given the diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach for implementing computational kernels that are performance-portable to multicore-CPU and manycore-accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel computational kernels, and (3) multidimensional arrays. Performance-portability is achieved by decoupling computational kernels from device-specific data access performance requirements (e.g., NVIDIA coalesced memory access) through an intuitive multidimensional array API. The Kokkos Array API uses C++ template meta-programming to, at compile time, transparently insert device-optimal data access maps into computational kernels. With this programming model computational kernels can be written once and, without modification, performance-portably compiled to multicore-CPU and manycore-accelerator devices.