A study of scalar compilation techniques for pipelined supercomputers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Software pipelining: an evaluation of enhanced pipelining
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Automatic partitioning of a program dependence graph into parallel tasks
IBM Journal of Research and Development
A general framework for iteration-reordering loop transformations
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Scalar replacement in the presence of conditional control flow
Software—Practice & Experience
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
Unrolling-based optimizations for modulo scheduling
Proceedings of the 28th annual international symposium on Microarchitecture
GURPR—a method for global software pipelining
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Unroll-and-jam using uniformly generated sets
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Parallel processing: a smart compiler and a dumb machine
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Aggressive Loop Unrolling in a Retargetable Optimizing Compiler
CC '96 Proceedings of the 6th International Conference on Compiler Construction
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
A Machine Learning Approach to Automatic Production of Compiler Heuristics
AIMSA '02 Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications
Improving register allocation for subscripted variables
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Improving Data Locality by Array Contraction
IEEE Transactions on Computers
Predicting Unroll Factors Using Supervised Classification
Proceedings of the international symposium on Code generation and optimization
The impact of loop unrolling on controller delay in high level synthesis
Proceedings of the conference on Design, automation and test in Europe
Compact multi-dimensional kernel extraction for register tiling
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Locality enhancement by array contraction
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Optimizing data locality using array tiling
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 0.01 |
In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and 1-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors.Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average 1.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).