Register tiling in nonrectangular iteration spaces

Authors:
Marta Jiménez;José M. Llabería;Agustín Fernández
Affiliations:
Universitat Politècnica de Catalunya, Barcelona, Spain;Universitat Politècnica de Catalunya, Barcelona, Spain;Universitat Politècnica de Catalunya, Barcelona, Spain
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
2002

Citing 45
Cited 13

Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Strategies for cache and local memory management by global program transformation

Proceedings of the 1st International Conference on Supercomputing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient and exact data dependence analysis

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
LAPACK's user's guide

LAPACK's user's guide
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
A practical data flow framework for array reference analysis and its use in optimizations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
MOB forms: a class of multilevel block algorithms for dense linear algebra operations

ICS '94 Proceedings of the 8th international conference on Supercomputing
Memory-hierarchy management

Memory-hierarchy management
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
DXML: a high-performance scientific subroutine library

Digital Technical Journal
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Alpha implementations and architecture: complete reference and guide

Alpha implementations and architecture: complete reference and guide
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Quantifying the multi-level nature of tiling interactions

International Journal of Parallel Programming
Optimized unrolling of nested loops

Proceedings of the 14th international conference on Supercomputing
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Performance Evaluation of Tiling for the Register Level

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Combining Optimization for Cache and Instruction-Level Parallelism

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Automatic Blocking of Nested Loops

Automatic Blocking of Nested Loops
Improving the performance of virtual memory computers.

Improving the performance of virtual memory computers.

Applications of storage mapping optimization to register promotion

Proceedings of the 18th annual international conference on Supercomputing
Architecture-aware classical Taylor shift by 1

Proceedings of the 2005 international symposium on Symbolic and algebraic computation
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Compact multi-dimensional kernel extraction for register tiling

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Combined ILP and register tiling: analytical model and optimization framework

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Efficient tiled loop generation: D-tiling

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Loop tiling is a well-known loop transformation generally used to expose coarse-grain parallelism and to exploit data reuse at the cache level. Tiling can also be used to exploit data reuse at the register level and to improve a program's ILP. However, previous proposals in the literature (as well as commercial compilers) are only able to perform multidimensional tiling for the register level when the iteration space is rectangular. In this article we present a new general algorithm to perform multidimensional tiling for the register level in both rectangular and nonrectangular iteration spaces. We also propose a simple heuristic to determine the tiling parameters at this level. Finally, we evaluate our method using as benchmarks typical linear algebra algorithms having nonrectangular iteration spaces and compare our proposal against hand-optimized vendor-supplied numerical libraries and against commercial compilers able to perform optimizing code transformations such as inner unrolling, unroll-and-jam, and software pipelining. Measurements were taken on three different superscalar microprocessors. Results will show that our method outperforms the native compilers (showing speedups of 2.5 in average) and matches the performance of vendor-supplied numerical libraries. The general conclusion is that compiler technology can make it possible for nonrectangular loop nests to achieve as high performance as hand-optimized codes.