Journal of Network and Computer Applications
Measuring High Performance Computing Productivity
International Journal of High Performance Computing Applications
High Performance Computing Productivity Model Synthesis
International Journal of High Performance Computing Applications
GPGPU: general purpose computation on graphics hardware
ACM SIGGRAPH 2004 Course Notes
Application of a development time productivity metric to parallel software development
Proceedings of the second international workshop on Software engineering for high performance computing system applications
Parallel Computing - Algorithmic skeletons
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Communications of the ACM
Dynamic Load Balancing on Dedicated Heterogeneous Systems
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Automatic Hybrid MPI+OpenMP Code Generation with llc
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
State-of-the-art in heterogeneous computing
Scientific Programming
Proceedings of the 3rd International Workshop on Multicore Software Engineering
Implementing openMP for clusters on top of MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Effective source-to-source outlining to support whole program empirical optimization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Unrolling loops containing task parallelism
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Parallel simulation of urban dynamics on the GPU
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part II
accULL: an OpenACC implementation with CUDA and OpenCL support
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A preliminary evaluation of OpenACC implementations
The Journal of Supercomputing
Hi-index | 0.00 |
Due to the current proliferation of GPU devices in HPC environments, scientist and engineers spend much of their time optimizing codes for these platforms. At the same time, manufactures produce new versions of their devices every few years, each one more powerful than the last. The question that arises is: is it optimization effort worthwhile? In this paper, we present a review of the different CUDA architectures, including Fermi, and optimize a set of algorithms for each using widely-known optimization techniques. This work would require a tremendous coding effort if done manually. However, using our fast prototyping tool, this is an effortless process. The result of our analysis will guide developers on the right path towards efficient code optimization. Preliminary results show that some optimizations recommended for older CUDA architectures may not be useful for the newer ones.