Communications of the ACM
Cactus Application: Performance Predictions in Grid Environments
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A fast Fourier transform compiler
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Implicitly parallel programming models for thousand-core microprocessors
Proceedings of the 44th annual Design Automation Conference
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
ICCIS '10 Proceedings of the 2010 International Conference on Computational and Information Sciences
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Vectorized higher order finite difference kernels
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Scaling large-data computations on multi-GPU accelerators
Proceedings of the 27th international ACM conference on International conference on supercomputing
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient 3D stencil computations using CUDA
Parallel Computing
Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures
Proceedings of Programming Models and Applications on Multicores and Manycores
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
International Journal of Parallel Programming
Hi-index | 0.00 |
This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This auto-tuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other auto-tuned stencil codes by a large margin.