Precise Data Locality Optimization of Nested Loops

Authors:
Vincent Loechner;Benoît Meister;Philippe Clauss
Affiliations:
ICPS/LSIIT, Université Louis Pasteur, Strasbourg, Pôle API, Bd Sébastien Brant, F-67400 Illkirch France loechner@icps.u-strasbg.fr;ICPS/LSIIT, Université Louis Pasteur, Strasbourg, Pôle API, Bd Sébastien Brant, F-67400 Illkirch France meister@icps.u-strasbg.fr;ICPS/LSIIT, Université Louis Pasteur, Strasbourg, Pôle API, Bd Sébastien Brant, F-67400 Illkirch France clauss@icps.u-strasbg.fr
Venue:
The Journal of Supercomputing
Year:
2002

Citing 21
Cited 13

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Theory of linear and integer programming

Theory of linear and integer programming
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiling for numa parallel machines

Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
Parameterized polyhedra and their vertices

International Journal of Parallel Programming
Parametric Analysis of Polyhedral Iteration Spaces

Journal of VLSI Signal Processing Systems - Special issue on application specific systems, architectures and processors
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonsingular Data Transformations: Definition, Validity, and Applications

International Journal of Parallel Programming
A matrix-based approach to global locality optimization

Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Handling Memory Cache Policy with Integer Points Counting

Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications

Programming skills for a changing world: back to the basics

Journal of Computing Sciences in Colleges
Applications of storage mapping optimization to register promotion

Proceedings of the 18th annual international conference on Supercomputing
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
New Complexity Results on Array Contraction and Related Problems

Journal of VLSI Signal Processing Systems
Load elimination for low-power embedded processors

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Obtaining Affine Transformations to Improve Locality of Loop Nests

Programming and Computing Software
Memory optimization by counting points in integer transformations of parametric polytopes

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Computation of storage requirements for multi-dimensional signal processing applications

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Improving data locality by chunking

CC'03 Proceedings of the 12th international conference on Compiler construction
Experiences with enumeration of integer projections of parametric polytopes

CC'05 Proceedings of the 14th international conference on Compiler Construction
Integer affine transformations of parametric ℤ-polytopes and applications to loop nest optimization

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.