Algorithmic skeletons: structured management of parallel computation
Algorithmic skeletons: structured management of parallel computation
Structured development of parallel programs
Structured development of parallel programs
More Effective C++: 35 New Ways to Improve Your Programs and Designs
More Effective C++: 35 New Ways to Improve Your Programs and Designs
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Patterns and skeletons for parallel and distributed computing
Patterns and skeletons for parallel and distributed computing
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining
Journal of Parallel and Distributed Computing
QUAFF: efficient C++ design for parallel skeletons
Parallel Computing - Algorithmic skeletons
BlockLib: a skeleton library for cell broadband engine
Proceedings of the 1st international workshop on Multicore software engineering
CuPP - A framework for easy CUDA integration
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Multi-target C++ implementation of parallel skeletons
Proceedings of the 8th workshop on Parallel/High-Performance Object-Oriented Scientific Computing
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming
APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
Skandium: Multi-core Programming with Algorithmic Skeletons
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems
Proceedings of the 4th International Workshop on Multicore Software Engineering
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Algorithmic skeletons for multi-core, multi-GPU systems and clusters
International Journal of High Performance Computing and Networking
Algorithmic species: A classification of affine loop nests for parallel programming
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A multi-GPU programming library for real-time applications
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Using the SkelCL library for high-level GPU programming of 2d applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Programmability and performance portability aspects of heterogeneous multi-/manycore systems
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Skeletal based programming for dynamic programming on MultiGPU systems
The Journal of Supercomputing
Algorithmic skeleton framework for the orchestration of GPU computations
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Exploiting heterogeneous parallelism with the Heterogeneous Programming Library
Journal of Parallel and Distributed Computing
The Journal of Supercomputing
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
We present SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Copying data between the host and the GPU device memory can be a performance bottleneck. A key technique in SkePU is the implementation of lazy memory copying in the container type used to represent skeleton operands, which allows to avoid unnecessary memory transfers. We evaluate SkePU with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that a skeleton approach to GPU programming is viable, especially when the computation burden is large compared to memory I/O (the lazy memory copying can help to achieve this). It also shows that utilizing several GPUs have a potential for performance gains. We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to 10 times faster run times when using SkePU with a GPU backend compared to a sequential solver running on a fast CPU.