Algorithmic skeletons: structured management of parallel computation
Algorithmic skeletons: structured management of parallel computation
A software architecture for user transparent parallel image processing
Parallel Computing - Parallel computing in image and video processing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming
APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
Implementing the PGI Accelerator model
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
SkePU: a multi-backend skeleton programming library for multi-GPU systems
Proceedings of the fourth international workshop on High-level parallel programming and applications
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
hiCUDA: High-Level GPGPU Programming
IEEE Transactions on Parallel and Distributed Systems
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Feasibility analysis of ultra high frame rate visual servoing on FPGA and SIMD processor
ACIVS'11 Proceedings of the 13th international conference on Advanced concepts for intelligent vision systems
Source-to-Source Code Translator: OpenMP C to CUDA
HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
SkelCL - A Portable Skeleton Library for High-Level GPU Programming
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
GPUs and the Future of Parallel Computing
IEEE Micro
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Proceedings of the 9th conference on Computing Frontiers
Algorithmic species: A classification of affine loop nests for parallel programming
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
KFusion: optimizing data flow without compromising modularity
Proceedings of the 12th annual international conference on Aspect-oriented software development
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers. A number of parallelizing source-to-source compilers have recently been developed to ease programming of multi-core and many-core processors. This work presents and evaluates a number of such tools, focused in particular on C-to-CUDA transformations targeting GPUs. We compare these tools both qualitatively and quantitatively to each other and identify their strengths and weaknesses. In this paper, we address the weaknesses by presenting a new classification of algorithms. This classification is used in a new source-to-source compiler, which is based on the algorithmic skeletons technique. The compiler generates target code based on skeletons of parallel structures, which can be seen as parameterisable library implementations for a set of algorithm classes. We furthermore demonstrate that the presented compiler requires little modifications to the original sequential source code, generates readable code for further fine-tuning, and delivers superior performance compared to other tools for a set of 8 image processing kernels.