IEEE Transactions on Parallel and Distributed Systems
Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences
IEEE Transactions on Parallel and Distributed Systems
Improving Locality in the Parallelization of Doacross Loops (Research Note)
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Transformations techniques for extracting parallelism in non-uniform nested loops
WSEAS Transactions on Computers
Affine and unimodular transformations for non-uniform nested loops
ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Hi-index | 0.00 |
Due to the overhead for exploiting and managing parallelism, run-time loop parallelization techniques with the aim of maximizing parallelism may not necessarily lead to the best performance. In this paper, we present two parallelization techniques that exploit different degrees of parallelism for loops with dynamic cross-iteration dependences. The DOALL approach exploits iteration-level parallelism. It restructures the loop into a sequence of do-parallel loops, separated by barrier operations. Iterations of a do-parallel loop are run in parallel. By contrast, the DOACROSS approach exposes fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences. The DOACROSS approach has variants that identify different amounts of parallelism among consecutive reads to the same memory location. We evaluate the algorithms for loops using various structures, memory access patterns, and computational workloads on symmetric multiprocessors. The algorithms are scheduled using block cyclic decomposition strategies. The experimental results show that the DOACROSS technique outperforms the DOALL, even though the latter is widely used in compile-time parallelization of loops. Of the DOACROSS variants, the algorithm allowing partially concurrent reads performs best because it incurs only slightly more overhead than the algorithm disallowing concurrent reads. The benefit from allowing fully concurrent reads is significant for small loops that do not have enough parallelism. However, it is likely to be outweighed by its cost for large loops or loops with light workload.