Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A general algorithm for tiling the register level
ICS '98 Proceedings of the 12th international conference on Supercomputing
An integer linear programming approach for optimizing cache locality
ICS '99 Proceedings of the 13th international conference on Supercomputing
Code transformations to improve memory parallelism
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Handling Global Constraints in Compiler Strategy
International Journal of Parallel Programming
Combining Loop Transformations Considering Caches and Scheduling
International Journal of Parallel Programming
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
Load Scheduling with Profile Information
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Cache Models for Iterative Compilation
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Embedded processor design challenges
On the Parallel Execution Time of Tiled Loops
IEEE Transactions on Parallel and Distributed Systems
Improving workload balance and code optimization on processor-in-memory systems
Journal of Systems and Software
Improving register allocation for subscripted variables
ACM SIGPLAN Notices - Best of PLDI 1979-1999
A Simulation and Exploration Technology for Multimedia-Application-Driven Architectures
Journal of VLSI Signal Processing Systems
Iterative compilation for energy reduction
Journal of Embedded Computing - Cache exploitation in embedded systems
Hi-index | 0.00 |
Current architectural trends in instruction-level parallelism (ILP) have significantly increased the computational power of microprocessors. As a result, the demands on the memory system have increased dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops or on improving cache performance. This paper presents a performance metric that can be used to guide the optimization of nested loops considering the combined effects of ILP, data reuse and latency hiding techniques. Preliminary experiments reveal that dramatic performance improvements for nested loops are obtainable (we regularly get at least a factor of 2 on kernels run on two different architectures).