A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler-directed scratch pad memory hierarchy design and management
Proceedings of the 39th annual Design Automation Conference
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications
EDTC '97 Proceedings of the 1997 European conference on Design and Test
An integrated simdization framework using virtual vectors
Proceedings of the 19th annual international conference on Supercomputing
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Accelerating advanced mri reconstructions on gpus
Proceedings of the 5th conference on Computing frontiers
The spec# programming system: an overview
CASSIS'04 Proceedings of the 2004 international conference on Construction and Analysis of Safe, Secure, and Interoperable Smart Devices
Extending abstract GPU APIs to shared memory
Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
memCUDA: map device memory to host memory on GPGPU platform
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Towards metaprogramming for parallel systems on a chip
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Compiler-directed memory management for heterogeneous MPSoCs
Journal of Systems Architecture: the EUROMICRO Journal
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A programming language interface to describe transformations and code generation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unified parallel C for GPU clusters: language extensions and compiler implementation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
A platform-independent tool for modeling parallel programs
Proceedings of the 49th Annual Southeast Regional Conference
CuMAPz: a tool to analyze memory access patterns in CUDA
Proceedings of the 48th Design Automation Conference
PTask: operating system abstractions to manage GPUs as compute devices
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
CUDACL+: a framework for GPU programs
Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A virtual memory based runtime to support multi-tenancy in clusters with GPUs
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Financial software on GPUs: between Haskell and Fortran
Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
GPUstore: harnessing GPU computing for storage systems in the OS kernel
Proceedings of the 5th Annual International Systems and Storage Conference
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs
International Journal of Computational Science and Engineering
From latex specifications to parallel codes
The Journal of Supercomputing
From physics model to results: an optimizing framework for cross-architecture code generation
Proceedings of the Extreme Scaling Workshop
Scaling large-data computations on multi-GPU accelerators
Proceedings of the 27th international ACM conference on International conference on supercomputing
Skeletal based programming for dynamic programming on MultiGPU systems
The Journal of Supercomputing
Memory performance estimation of CUDA programs
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
User transparent data and task parallel multimedia computing with Pyxis-DT
Future Generation Computer Systems
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Implementation of a High Performance GPGPU Compiler
International Journal of Parallel Programming
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.