Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Uniform techniques for loop optimization
ICS '91 Proceedings of the 5th international conference on Supercomputing
Detecting coarse-grain parallelism using an interprocedural parallelizing compiler
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
On the Granularity and Clustering of Directed Acyclic Task Graphs
IEEE Transactions on Parallel and Distributed Systems
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Support for OpenMP tasks in Nanos v4
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
A language for the compact representation of multiple program versions
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Reducing task creation and termination overhead in explicitly parallel programs
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Optimization strategies in different CUDA architectures using llCoMP
Microprocessors & Microsystems
A Transformation Framework for Optimizing Task-Parallel Programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
Hi-index | 0.00 |
Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains parallelism inside most compilers will ignore it or perform a naïve transformation. We propose to extend the semantics of the loop unrolling transformation to cover loops that contain task parallelism. In these cases, the transformation will try to aggregate the multiple tasks that appear after a classic unrolling phase to reduce the overheads per iteration. We present an implementation of such extended loop unrolling for OpenMP tasks with two phases: a classical unroll followed by a task aggregation phase. Our aggregation technique covers the special cases where task parallelism appears inside branches or where the loop is uncountable. Our experimental results show that using this extended unroll allows loops with fine-grained tasks to reduce the overheads associated with task creation and obtain a much better scaling.