Combining Optimization for Cache and Instruction-Level Parallelism

Authors:
Steve Carr
Affiliations:
-
Venue:
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Year:
1996

Citing 0
Cited 21

Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
An integer linear programming approach for optimizing cache locality

ICS '99 Proceedings of the 13th international conference on Supercomputing
Code transformations to improve memory parallelism

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Handling Global Constraints in Compiler Strategy

International Journal of Parallel Programming
Combining Loop Transformations Considering Caches and Scheduling

International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Methods for Achieving Peak Computational Rates for Linear Algebra Operations on Superscalar RISC Processors

PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
Iterative Compilation

Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Load Scheduling with Profile Information

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Cache Models for Iterative Compilation

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Iterative compilation

Embedded processor design challenges
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Improving workload balance and code optimization on processor-in-memory systems

Journal of Systems and Software
Improving register allocation for subscripted variables

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A Simulation and Exploration Technology for Multimedia-Application-Driven Architectures

Journal of VLSI Signal Processing Systems
Iterative compilation for energy reduction

Journal of Embedded Computing - Cache exploitation in embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current architectural trends in instruction-level parallelism (ILP) have significantly increased the computational power of microprocessors. As a result, the demands on the memory system have increased dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops or on improving cache performance. This paper presents a performance metric that can be used to guide the optimization of nested loops considering the combined effects of ILP, data reuse and latency hiding techniques. Preliminary experiments reveal that dramatic performance improvements for nested loops are obtainable (we regularly get at least a factor of 2 on kernels run on two different architectures).