Optimization strategies in different CUDA architectures using llCoMP

Authors:
Ruymán Reyes;Francisco de Sande
Affiliations:
Dept. de E. I. O. y Computación, Universidad de La Laguna, 38271 La Laguna, Spain;Dept. de E. I. O. y Computación, Universidad de La Laguna, 38271 La Laguna, Spain
Venue:
Microprocessors & Microsystems
Year:
2012

Citing 21
Cited 3

An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Journal of Network and Computer Applications
Measuring High Performance Computing Productivity

International Journal of High Performance Computing Applications
High Performance Computing Productivity Model Synthesis

International Journal of High Performance Computing Applications
GPGPU: general purpose computation on graphics hardware

ACM SIGGRAPH 2004 Course Notes
Application of a development time productivity metric to parallel software development

Proceedings of the second international workshop on Software engineering for high performance computing system applications
Basic skeletons in 11c

Parallel Computing - Algorithmic skeletons
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scalable Parallel Programming with CUDA

Queue - GPU Computing
A closer look at GPUs

Communications of the ACM
Dynamic Load Balancing on Dedicated Heterogeneous Systems

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Automatic Hybrid MPI+OpenMP Code Generation with llc

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Cetus: A Source-to-Source Compiler Infrastructure for Multicores

Computer
State-of-the-art in heterogeneous computing

Scientific Programming
JCudaMP: OpenMP/Java on CUDA

Proceedings of the 3rd International Workshop on Multicore Software Engineering
Implementing openMP for clusters on top of MPI

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Effective source-to-source outlining to support whole program empirical optimization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Unrolling loops containing task parallelism

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more

Parallel simulation of urban dynamics on the GPU

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part II
accULL: an OpenACC implementation with CUDA and OpenCL support

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A preliminary evaluation of OpenACC implementations

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the current proliferation of GPU devices in HPC environments, scientist and engineers spend much of their time optimizing codes for these platforms. At the same time, manufactures produce new versions of their devices every few years, each one more powerful than the last. The question that arises is: is it optimization effort worthwhile? In this paper, we present a review of the different CUDA architectures, including Fermi, and optimize a set of algorithms for each using widely-known optimization techniques. This work would require a tremendous coding effort if done manually. However, using our fast prototyping tool, this is an effortless process. The result of our analysis will guide developers on the right path towards efficient code optimization. Preliminary results show that some optimizations recommended for older CUDA architectures may not be useful for the newer ones.