Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Omega Library interface guide
The Omega Library interface guide
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Selecting tile shape for minimal execution time
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Program Improvement by Source-to-Source Transformation
Journal of the ACM (JACM)
Tiling imperfectly-nested loop nests
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Time Skewing for Parallel Computers
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Cactus Application: Performance Predictions in Grid Environments
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Compact thermal modeling for temperature-aware design
Proceedings of the 41st annual Design Automation Conference
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations
IEEE Transactions on Computers
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Experiences in using cetus for source-to-source transformations
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
On-chip cache hierarchy-aware tile scheduling for multicore machines
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Hi-index | 0.00 |
This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overlapped tiling performs overlapped tiling hierarchically to balance communication overhead and redundant computation, and thus has the potential to provide better performance. In this paper, we describe the hierarchical overlapped tiling optimization and its implementation in an OpenCL compiler. We also evaluate the effectiveness of this optimization using 8 programs that implement different forms of stencil computation. Our results show that hierarchical overlapped tiling achieves an average 37% speedup over traditional tiling on a 32-core workstation.