Fortran at ten gigaflops: the connection machine convolution compiler
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Building domain-specific embedded languages
ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
Lattice Boltzmann method for 3-D flows with curved boundary
Journal of Computational Physics
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots
Discrete Applied Mathematics - Special volume on combinatorial molecular biology
Domain-specific languages: an annotated bibliography
ACM SIGPLAN Notices
Compiling stencils in high performance Fortran
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cache oblivious stencil computations
Proceedings of the 19th annual international conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
When and how to develop domain-specific languages
ACM Computing Surveys (CSUR)
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
The Cache Complexity of Multithreaded Cache Oblivious Algorithms
Theory of Computing Systems - Special Issue: Symposium on Parallelism in Algorithms and Architectures 2006; Guest Editors: Robert Kleinberg and Christian Scheideler
High-order stencil computations on multicore clusters
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
A Multilevel Parallelization Framework for High-Order Stencil Computations
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
The Cilkview scalability analyzer
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Cache-Oblivious Dynamic Programming for Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Auto-tuning stencil codes for cache-based multicore platforms
Auto-tuning stencil codes for cache-based multicore platforms
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Hierarchical overlapped tiling
Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
Elixir: a system for synthesizing concurrent graph programs
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
International Journal of High Performance Computing Applications
Compiling affine loop nests for distributed-memory parallel architectures
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Embrace, defend, extend: a methodology for embedding preexisting DSLs
Proceedings of the 1st annual workshop on Functional programming concepts in domain-specific languages
Generating efficient data movement code for heterogeneous architectures with distributed-memory
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Tight bounds for low dimensional star stencils in the external memory model
WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Efficient 3D stencil computations using CUDA
Parallel Computing
Towards making autotuning mainstream
International Journal of High Performance Computing Applications
Automatic data allocation and buffer management for multi-GPU machines
ACM Transactions on Architecture and Code Optimization (TACO)
Hybrid Hexagonal/Classical Tiling for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.