Implicit and explicit optimizations for stencil computations

Authors:
Shoaib Kamil;Kaushik Datta;Samuel Williams;Leonid Oliker;John Shalf;Katherine Yelick
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA and University of California, Berkeley, CA
Venue:
Proceedings of the 2006 workshop on Memory system performance and correctness
Year:
2006

Citing 7
Cited 31

Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers

Sketching stencils

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Smashing: Folding Space to Tile through Time

Languages and Compilers for Parallel Computing
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
State-of-the-art in heterogeneous computing

Scientific Programming
Cache oblivious parallelograms in iterative stencil computations

Proceedings of the 24th ACM International Conference on Supercomputing
Introducing the semi-stencil algorithm

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Time skewing made simple

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
HiFlow3: a flexible and hardware-aware parallel finite element package

Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation

Journal of Scientific Computing
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fast seismic modeling and reverse time migration on a graphics processing unit cluster

Concurrency and Computation: Practice & Experience
High throughput software for direct numerical simulations of compressible two-phase flows

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A data-driven approach for executing the CG method on reconfigurable high-performance systems

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Performance-reliability tradeoff analysis for multithreaded applications

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.