Integer and combinatorial optimization
Integer and combinatorial optimization
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
New CPU benchmark suites from SPEC
COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
Partitioning the statement per iteration space using non-singular matrices
ICS '93 Proceedings of the 7th international conference on Supercomputing
Compiling for numa parallel machines
Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Omega Library interface guide
The Omega Library interface guide
Automatic data layout for high performance Fortran
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A novel approach towards automatic data distribution
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data distribution support on distributed shared memory multiprocessors
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A hyperplane based approach for optimizing spatial locality in loop nests
ICS '98 Proceedings of the 12th international conference on Supercomputing
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Advanced compiler design and implementation
Advanced compiler design and implementation
Improving locality using loop and data transformations in an integrated framework
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
An integer linear programming approach for optimizing cache locality
ICS '99 Proceedings of the 13th international conference on Supercomputing
Dynamic data distribution with control flow analysis
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Transformations for imperfectly nested loops
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Performance analysis using the MIPS R10000 performance counters
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
False Sharing and Spatial Locality in Multiprocessor Caches
IEEE Transactions on Computers
Compiling Communication-Efficient Programs for Massively Parallel Machines
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Graph Based Framework to Detect Optimal Memory Layouts for Improving Data Locality
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Locality Analysis for Distributed Shared-Memory Multiprocessors
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers
LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
A Matrix-Based Approach to the Global Locality Optimization Problem
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Integrating Loop and Data Transformations for Global Optimisation
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Improving the performance of virtual memory computers.
Improving the performance of virtual memory computers.
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
A 0-1 integer linear programming based approach for global locality optimizations
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Journal of Parallel and Distributed Computing
Empirical performance-model driven data layout optimization
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Hi-index | 0.00 |
The delivered performance on modern processors that employ deep memory hierarchies is closely related to the performance of the memory subsystem. Compiler optimizations aimed at improving cache locality are critical in realizing the performance potential of powerful processors. For scientific applications, several loop transformations have been shown to be useful in improving both temporal and spatial locality. Recently, there has been some work in the area of data layout optimizations, i.e., changing the memory layouts of multidimensional arrays from the language-defined default such as column-major storage in Fortran. The effect of such memory layout decisions is on the spatial locality characteristics of loop nests. While data layout transformations are not constrained by data dependences, they have no effect on temporal locality. On the other hand, loop transformations are not readily applicable to imperfect loop nests and are constrained by data dependences. More importantly, loop transformations affect the memory access patterns of all the arrays accessed in a loop nest and, as a result, the locality characteristics of some of the arrays may worsen. This paper presents a technique based on integer linear programming (ILP) that attempts to derive the best combination of loop and data layout transformations. Prior attempts to unify loop and data layout transformations for programs consisting of a sequence of loop nests have been based on heuristics not only for transformations for a single loop nest but also for the sequence in which loop nests will be considered. The ILP formulation presented here obviates the need for such heuristics and gives us a bar against which the heuristic algorithms can be compared. More importantly, our approach is able to transform memory layouts dynamically during program execution. This is particularly useful in applications whose disjoint code segments demand different layouts for a given array. In addition, we show how this formulation can be extended to address the false sharing problem in a multiprocessor environment. The key data structure we introduce is the memory layout graph (MLG) that allows us to formulate the problems as path problems. The paper discusses the relationship of this ILP approach based on the memory layout graphs to other work in the area including our previous work. Experimental results on a MIPS R10000-based system demonstrate the benefits of this approach and show that the use of the ILP formulation does not increase the compilation time significantly.