On optimal parallelization of arbitrary loops
Journal of Parallel and Distributed Computing
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
EEL: machine-independent executable editing
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
ACM Computing Surveys (CSUR)
Affine-by-statement scheduling of uniform and affine loop nests over parametric domains
Journal of Parallel and Distributed Computing
International Journal of Parallel Programming - Special issue on parallel architectures and compilation techniques
Optimizing Software Data Prefetches with Rotating Registers
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Itanium 2 Processor Microarchitecture
IEEE Micro
The design and architecture of MAQAOAdvisor: a live tuning guide
HiPC'08 Proceedings of the 15th international conference on High performance computing
Hi-index | 0.00 |
An optimizing compiler has a hard time to generate a code which will perform at top speed for an arbitrary data set size. In general, the low level optimization process must take into account parameters such as loop trip count for generating efficient code. The code can be specialized depending upon data set size ranges, at the expense of code expansion and decision tree overhead. We propose for loop structures a new method to specialize code at the assembly level, cutting drastically the overhead cost with a new folding approach. Our technique can generate and combine sequentially at the assembly level several versions, tuned for small, medium and large iteration number. We first show on the SPEC benchmarks the need for specialization on small loops. Then we demonstrate the benefit of our method on kernels with detailed results.