Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Authors:
Kaushik Datta;Mark Murphy;Vasily Volkov;Samuel Williams;Jonathan Carter;Leonid Oliker;David Patterson;John Shalf;Katherine Yelick
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA;University of California at Berkeley, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 8
Cited 89

Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Evaluating multi-core platforms for HPC data-intensive kernels

Proceedings of the 6th ACM conference on Computing frontiers
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Autotuning multigrid with PetaBricks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A design methodology for domain-optimized power-efficient supercomputing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PASTHA: parallelizing stencil calculations in Haskell

Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
State-of-the-art in heterogeneous computing

Scientific Programming
Efficient simulation of agent-based models on multi-GPU and multi-core clusters

Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Evaluation of streaming aggregation on parallel hardware architectures

Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
IBM BladeCenter QS22: design, performance, and utilization in hybrid computing systems

IBM Journal of Research and Development
Integrated execution: a programming model for accelerators

IBM Journal of Research and Development
Software architecture and system validation of an open, unified model for accelerated multicore computing

IBM Journal of Research and Development
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Introducing the semi-stencil algorithm

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A language-based tuning mechanism for task and pipeline parallelism

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation

ACM SIGARCH Computer Architecture News
Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Optimizing and auto-tuning belief propagation on the GPU

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Parallel 3D multigrid methods on the STI cell BE architecture

Facing the multicore-challenge
Parallel 3D multigrid methods on the STI cell BE architecture

Facing the multicore-challenge
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Efficient parallel stencil convolution in Haskell

Proceedings of the 4th ACM symposium on Haskell
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hardware/software co-design for energy-efficient seismic modeling

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation

ACM SIGARCH Computer Architecture News
Parallel simulation of dendritic growth on unstructured grids

Proceedings of the first workshop on Irregular applications: architectures and algorithm
CUDA 2d stencil computations for the jacobi method

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Streaming model computation of the FDTD problem

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Fast wavelet transform utilizing a multicore-aware framework

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Efficiently implementing monte carlo electrostatics simulations on multicore accelerators

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Revisiting finite difference and spectral migration methods on diverse parallel architectures

Computers & Geosciences
Fast seismic modeling and reverse time migration on a graphics processing unit cluster

Concurrency and Computation: Practice & Experience
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
Domain-Specific language and compiler for stencil computation on FPGA-Based systolic computational-memory array

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Parallel Computing
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
Accelerating the red/black SOR method using GPUs with CUDA

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Autotuning of adaptive mesh refinement PDE solvers on shared memory architectures

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
Non-intrusive coscheduling for general purpose operating systems

MSEPT'12 Proceedings of the 2012 international conference on Multicore Software Engineering, Performance, and Tools
Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

The Journal of Supercomputing
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization of geometric multigrid for emerging multi- and manycore processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
ASK: adaptive sampling kit for performance characterization

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

International Journal of High Performance Computing Applications
Finite Element Integration on GPUs

ACM Transactions on Mathematical Software (TOMS)
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel schedule synthesis for attribute grammars

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
PDQ: Parallel Distance Queries for deformable meshes

Graphical Models
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Performance comparison of GPU programming frameworks with the striped Smith-Waterman algorithm

ACM SIGARCH Computer Architecture News - ACM SIGARCH Computer Architecture News/HEART '12
From physics model to results: an optimizing framework for cross-architecture code generation

Proceedings of the Extreme Scaling Workshop
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing
Micro adaptivity in Vectorwise

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Achieving load-balancing in power system parallel contingency analysis using X10 programming language

Proceedings of the third ACM SIGPLAN X10 Workshop
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A scalable parallel algorithm for dynamic range-limited n-tuple computation in many-body molecular dynamics simulation

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Taming parallel I/O complexity with auto-tuning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Cluster Computing
Starchart: hardware and software optimization using recursive partitioning regression trees

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Efficient 3D stencil computations using CUDA

Parallel Computing
Explicit routing schemes for implementation of cellular automata on processor arrays

Natural Computing: an international journal
High level transforms for SIMD and low-level computer vision algorithms

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.