Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation
Proceedings of the 1st International Conference on Supercomputing
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Scalar replacement in the presence of conditional control flow
Software—Practice & Experience
Memory-hierarchy management
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Parallel Programming Support in ParaScope
Proceedings of the 4th International DFVLR Seminar on Foundations of Engineering Sciences: Parallel Computing in Science and Engineering
Loop Quantization: an Analysis and Algorithm
Loop Quantization: an Analysis and Algorithm
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Optimized unrolling of nested loops
Proceedings of the 14th international conference on Supercomputing
Energy-driven integrated hardware-software optimizations using SimplePower
Proceedings of the 27th annual international symposium on Computer architecture
Optimized Unrolling of Nested Loops
International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluating Integrated Hardware-Software Optimizations Using a Unified Energy Estimation Framework
IEEE Transactions on Computers
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Quantitative Analysis of Tile Size Selection Algorithms
The Journal of Supercomputing
Improving register allocation for subscripted variables
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Predicting Unroll Factors Using Supervised Classification
Proceedings of the international symposium on Code generation and optimization
Complementing software pipelining with software thread integration
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Automatic analysis for managing and optimizing performance-code quality
Proceedings of the 2008 workshop on Static analysis
Compact multi-dimensional kernel extraction for register tiling
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Hi-index | 0.00 |
Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work either has used a dependence-based model to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique. In this paper, we present an algorithm that uses a linear-algebra-based technique to compute unroll amounts. This technique results in an 84% reduction over dependence-based techniques in the total number of dependences needed in our benchmark suite. Additionally, there is no loss in optimization performance over previous techniques and a more elegant solution is utilized.