Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Lithographic aerial image simulation with FPGA-based hardwareacceleration
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Cache oblivious parallelograms in iterative stencil computations
Proceedings of the 24th ACM International Conference on Supercomputing
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Auto-tuning stencil codes for cache-based multicore platforms
Auto-tuning stencil codes for cache-based multicore platforms
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Model-driven tile size selection for DOACROSS loops on GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Accelerating Fluid Registration Algorithm on Multi-FPGA Platforms
FPL '11 Proceedings of the 2011 21st International Conference on Field Programmable Logic and Applications
PADS: A Pattern-Driven Stencil Compiler-Based Tool for Reuse of Optimizations on GPGPUs
ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems
Vectorized higher order finite difference kernels
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Memory reuse optimizations in the R-Stream compiler
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient 3D stencil computations using CUDA
Parallel Computing
Hybrid Hexagonal/Classical Tiling for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
International Journal of Parallel Programming
Hi-index | 0.00 |
Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these architectures offer challenges for developers and compilers alike. Stencil computations in particular require careful attention to off-chip memory access and the balancing of work among compute units in GPU devices. In this paper, we present a code generation scheme for stencil computations on GPU accelerators, which optimizes the code by trading an increase in the computational workload for a decrease in the required global memory bandwidth. We develop compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation. We show that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.