The pochoir stencil compiler

Authors:
Yuan Tang;Rezaul Alam Chowdhury;Bradley C. Kuszmaul;Chi-Keung Luk;Charles E. Leiserson
Affiliations:
Fudan University, MIT, Cambridge, MA, USA;Boston University, MIT, Cambridge, MA, USA;MIT, Cambridge, MA, USA;Intel Corp., Hudson, MA, USA;MIT, Cambridge, MA, USA
Venue:
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Year:
2011

Citing 21
Cited 18

Fortran at ten gigaflops: the connection machine convolution compiler

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Building domain-specific embedded languages

ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
Lattice Boltzmann method for 3-D flows with curved boundary

Journal of Computational Physics
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots

Discrete Applied Mathematics - Special volume on combinatorial molecular biology
Domain-specific languages: an annotated bibliography

ACM SIGPLAN Notices
Compiling stencils in high performance Fortran

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
When and how to develop domain-specific languages

ACM Computing Surveys (CSUR)
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Theory of Computing Systems - Special Issue: Symposium on Parallelism in Algorithms and Architectures 2006; Guest Editors: Robert Kleinberg and Christian Scheideler
High-order stencil computations on multicore clusters

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Cache-Oblivious Dynamic Programming for Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Auto-tuning stencil codes for cache-based multicore platforms

Auto-tuning stencil codes for cache-based multicore platforms

Domain-Specific language and compiler for stencil computation on FPGA-Based systolic computational-memory array

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
Elixir: a system for synthesizing concurrent graph programs

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Embrace, defend, extend: a methodology for embedding preexisting DSLs

Proceedings of the 1st annual workshop on Functional programming concepts in domain-specific languages
Generating efficient data movement code for heterogeneous architectures with distributed-memory

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Tight bounds for low dimensional star stencils in the external memory model

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Efficient 3D stencil computations using CUDA

Parallel Computing
Towards making autotuning mainstream

International Journal of High Performance Computing Applications
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Hybrid Hexagonal/Classical Tiling for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.