Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Coloring heuristics for register allocation
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Introduction to algorithms
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A practical algorithm for exact array dependence analysis
Communications of the ACM
Network flows: theory, algorithms, and applications
Network flows: theory, algorithms, and applications
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
ACM Computing Surveys (CSUR)
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Exploiting monotone convergence functions in parallel programs
Exploiting monotone convergence functions in parallel programs
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Non-singular data transformations: definition, validity and applications
ICS '97 Proceedings of the 11th international conference on Supercomputing
Experience with efficient array data flow analysis for array privatization
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Interprocedural analysis for loop scheduling and data allocation
Parallel Computing - Special issues on languages and compilers for parallel computers
Schedule-independent storage mapping for loops
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Nonlinear and Symbolic Data Dependence Testing
IEEE Transactions on Parallel and Distributed Systems
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
Proceedings of the 14th international conference on Supercomputing
Loop tiling for parallelism
Transformations for imperfectly nested loops
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
Achieving Scalable Locality with Time Skewing
International Journal of Parallel Programming
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
A Comparison of Compiler Tiling Algorithms
CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
A Matrix-Based Approach to the Global Locality Optimization Problem
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Fine-grained analysis of array computations
Fine-grained analysis of array computations
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Generation and optimisation of code using Coxeter lattice paths
Proceedings of the 2007 international workshop on Parallel symbolic computation
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Analyzing memory access intensity in parallel programs on multicore
Proceedings of the 22nd annual international conference on Supercomputing
Using Padding to Optimize Locality in Scientific Applications
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
Simultaneous minimization of capacity and conflict misses
Journal of Computer Science and Technology
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
Computer Science - Research and Development
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation
Journal of Scientific Computing
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 50th Annual Design Automation Conference
International Journal of High Performance Computing Applications
An application-centric evaluation of OpenCL on multi-core CPUs
Parallel Computing
Hi-index | 0.00 |
Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation and signal processing. Such loops iteratively modify the same array elements over different time steps, which presents opportunities for the compiler to improve the temporal data locality through loop tiling. This article presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The article first presents a technique which allows loop tiling to satisfy data dependences in spite of the difficulty created by imperfectly nested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the article shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graph-theoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memory-cost analysis derives the tile size which minimizes capacity misses. Given the tile size, an efficient and general array-padding scheme is applied to remove conflict misses. Experiments were conducted on 16 test programs and preliminary results showed an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs.