Compositional approach applied to loop specialization

Authors:
Lamia Djoudi;Jean-Thomas Acquaviva;Denis Barthou
Affiliations:
Université de Versailles, France;Université de Versailles, France;Université de Versailles, France
Venue:
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Year:
2007

Citing 8
Cited 1

On optimal parallelization of arbitrary loops

Journal of Parallel and Distributed Computing
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
EEL: machine-independent executable editing

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Software pipelining

ACM Computing Surveys (CSUR)
Affine-by-statement scheduling of uniform and affine loop nests over parametric domains

Journal of Parallel and Distributed Computing
Index set splitting

International Journal of Parallel Programming - Special issue on parallel architectures and compilation techniques
Optimizing Software Data Prefetches with Rotating Registers

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Itanium 2 Processor Microarchitecture

IEEE Micro

The design and architecture of MAQAOAdvisor: a live tuning guide

HiPC'08 Proceedings of the 15th international conference on High performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

An optimizing compiler has a hard time to generate a code which will perform at top speed for an arbitrary data set size. In general, the low level optimization process must take into account parameters such as loop trip count for generating efficient code. The code can be specialized depending upon data set size ranges, at the expense of code expansion and decision tree overhead. We propose for loop structures a new method to specialize code at the assembly level, cutting drastically the overhead cost with a new folding approach. Our technique can generate and combine sequentially at the assembly level several versions, tuned for small, medium and large iteration number. We first show on the SPEC benchmarks the need for specialization on small loops. Then we demonstrate the benefit of our method on kernels with detailed results.