From physics model to results: An optimizing framework for cross-architecture code generation

Authors:
Marek Blazewicz;Ian Hinder;David M. Koppelman;Steven R. Brandt;Milosz Ciznicki;Michal Kierzynka;Frank Löffler;Erik Schnetter;Jian Tao
Affiliations:
Applications Department, Poznań Supercomputing & Networking Center, Poznańń, Poland and Poznań University of Technology, Poznań, Poland;Max-Planck-Institut für Gravitationsphysik, Albert-Einstein-Institut, Potsdam, Germany;Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, USA and Division of Electrical & Computer Engineering, Louisiana State University, Baton Rouge, LA, USA;Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, USA and Division of Computer Science, Louisiana State University, Baton Rouge, LA, USA;Applications Department, Poznań Supercomputing & Networking Center, Poznańń, Poland;Applications Department, Poznań Supercomputing & Networking Center, Poznańń, Poland and Poznań University of Technology, Poznań, Poland;Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, USA;Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, USA and Perimeter Institute for Theoretical Physics, Waterloo, ON, Canada and Department of Physics, University of ...;Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, USA
Venue:
Scientific Programming
Year:
2013

Citing 14
Cited 0

Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
BSGP: bulk-synchronous GPU programming

ACM SIGGRAPH 2008 papers
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
The cactus framework and toolkit: design and applications

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Using GPU's to accelerate stencil-based computation kernels for the development of large scale scientific applications on heterogeneous systems

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
CaKernel --A parallel application programming framework for heterogenous computing architectures

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.