POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Scanning polyhedra with DO loops
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Integration, the VLSI Journal
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Determining the idle time of a tiling
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Selecting tile shape for minimal execution time
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
Proceedings of the 14th international conference on Supercomputing
Tiling imperfectly-nested loop nests
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
On tiling space-time mapped loop nests
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
High performance Fortran compilation techniques for parallelizing scientific codes
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests
International Journal of Parallel Programming
Computer and Robot Vision
On Time Optimal Supernode Shape
IEEE Transactions on Parallel and Distributed Systems
Program Analysis of Overlap Area Usage in Self-Similar Parallel Programs
LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Code generation for multiple mappings
FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Automatic Blocking of Nested Loops
Automatic Blocking of Nested Loops
A Geometric Programming Framework for Optimal Multi-Level Tiling
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
The memory behavior of cache oblivious stencil computations
The Journal of Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Global Tiling for Communication Minimal Parallelization on Distributed Memory Systems
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Smashing: Folding Space to Tile through Time
Languages and Compilers for Parallel Computing
A Hardware Task Scheduler for Embedded Video Processing
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Cache oblivious parallelograms in iterative stencil computations
Proceedings of the 24th ACM International Conference on Supercomputing
Exposing tunable parameters in multi-threaded numerical code
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Landing stencil code on Godson-T
Journal of Computer Science and Technology
Locality optimization of stencil applications using data dependency graphs
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
Efficient parallel stencil convolution in Haskell
Proceedings of the 4th ACM symposium on Haskell
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hierarchical overlapped tiling
Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization principles for collective neighborhood communications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
Performance-reliability tradeoff analysis for multithreaded applications
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
International Journal of High Performance Computing Applications
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hybrid Hexagonal/Classical Tiling for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the time dimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approach for automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Experimental results are provided that demonstrate the effectiveness of the approach.