Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Efficient management of parallelism in object-oriented numerical software libraries
Modern software tools for scientific computing
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
14.9 TFLOPS three-dimensional fluid simulation for fusion science with HPF on the Earth Simulator
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The rise and fall of High Performance Fortran: an historical object lesson
Proceedings of the third ACM SIGPLAN conference on History of programming languages
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Hardware-aware analysis and optimization of stable fluids
Proceedings of the 2008 symposium on Interactive 3D graphics and games
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Ypnos: declarative, parallel structured grid programming
Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
Domain Specific Languages
Language virtualization for heterogeneous parallel computing
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Parallel simulation of dendritic growth on unstructured grids
Proceedings of the first workshop on Irregular applications: architectures and algorithm
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
High throughput software for direct numerical simulations of compressible two-phase flows
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
PARTANS: An autotuning framework for stencil computation on multi-GPU systems
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Vectorized higher order finite difference kernels
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
From physics model to results: an optimizing framework for cross-architecture code generation
Proceedings of the Extreme Scaling Workshop
Abstractions to separate concerns in semi-regular grids
Proceedings of the 27th international ACM conference on International conference on supercomputing
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.01 |
This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and implicitly parallel manner. Our framework translates the user-written code into actual implementation code in CUDA for GPU acceleration and MPI for node-level parallelization with automatic optimizations such as computation and communication overlapping. We demonstrate the feasibility of such automatic translations by implementing several structured grid applications in our framework. Experimental results on the TSUBAME2.0 GPU-based supercomputer show that the performance is comparable as hand-written code and good strong and weak scalability up to 256 GPUs.