Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Authors:
Naoya Maruyama;Tatsuo Nomura;Kento Sato;Satoshi Matsuoka
Affiliations:
Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo, Japan;Google, Inc., Roppongi, Minato-ku, Tokyo, Japan;Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo, Japan;Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo, Japan
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 19
Cited 10

More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Efficient management of parallelism in object-oriented numerical software libraries

Modern software tools for scientific computing
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
14.9 TFLOPS three-dimensional fluid simulation for fusion science with HPF on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Hardware-aware analysis and optimization of stable fluids

Proceedings of the 2008 symposium on Interactive 3D graphics and games
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Ypnos: declarative, parallel structured grid programming

Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
Domain Specific Languages

Domain Specific Languages
Language virtualization for heterogeneous parallel computing

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing

Parallel simulation of dendritic growth on unstructured grids

Proceedings of the first workshop on Irregular applications: architectures and algorithm
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
High throughput software for direct numerical simulations of compressible two-phase flows

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
PARTANS: An autotuning framework for stencil computation on multi-GPU systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Vectorized higher order finite difference kernels

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
From physics model to results: an optimizing framework for cross-architecture code generation

Proceedings of the Extreme Scaling Workshop
Abstractions to separate concerns in semi-regular grids

Proceedings of the 27th international ACM conference on International conference on supercomputing
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and implicitly parallel manner. Our framework translates the user-written code into actual implementation code in CUDA for GPU acceleration and MPI for node-level parallelization with automatic optimizations such as computation and communication overlapping. We demonstrate the feasibility of such automatic translations by implementing several structured grid applications in our framework. Experimental results on the TSUBAME2.0 GPU-based supercomputer show that the performance is comparable as hand-written code and good strong and weak scalability up to 256 GPUs.