Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Treating a User-Defined Parallel Library as a Domain-Specific Language
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Finite Differences And Partial Differential Equations
Finite Differences And Partial Differential Equations
Digital Image Processing (3rd Edition)
Digital Image Processing (3rd Edition)
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
Scalable parallel programming with CUDA
ACM SIGGRAPH 2008 classes
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing the PGI Accelerator model
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Chestnut: a GPU programming language for non-experts
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
A systematic process for efficient execution on Intel's heterogeneous computation nodes
Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
PARTANS: An autotuning framework for stencil computation on multi-GPU systems
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
From physics model to results: an optimizing framework for cross-architecture code generation
Proceedings of the Extreme Scaling Workshop
Implementing OmpSs support for regions of data in architectures with multiple address spaces
Proceedings of the 27th international ACM conference on International conference on supercomputing
Abstractions to separate concerns in semi-regular grids
Proceedings of the 27th international ACM conference on International conference on supercomputing
Journal of Parallel and Distributed Computing
Efficient Mapping of Irregular C++ Applications to Integrated GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
International Journal of Parallel Programming
Hi-index | 0.00 |
We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80% of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.