Compiler algorithms for synchronization
IEEE Transactions on Computers
On data synchronization for multiprocessors
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Compiler algorithms for event variable synchronization
ICS '91 Proceedings of the 5th international conference on Supercomputing
Compiler optimizations for parallel loops with fine-grained synchronization
Compiler optimizations for parallel loops with fine-grained synchronization
Compiler techniques for data synchronization in nested parallel loops
ICS '90 Proceedings of the 4th international conference on Supercomputing
Compilation techniques for parallel systems
Parallel Computing - Special Anniversary issue
Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs
International Journal of Parallel Programming
Multiprocessor Synchronization for Concurrent Loops
IEEE Software
Removal of Redundant Dependences in DOACROSS Loops with Constant Dependences
IEEE Transactions on Parallel and Distributed Systems
Optimally Synchronizing DOACROSS Loops on Shared Memory Multiprocessors
PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
On the interaction of tiling and automatic parallelization
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Experiments with auto-parallelizing SPEC2000FP benchmarks
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Hi-index | 0.00 |
Loops with cross-iteration dependences (doacross loops) often contain significant amounts of parallelism that can potentially be exploited on modern manycore processors. However, most production-strength compilers focus their automatic parallelization efforts on doall loops, and consider doacross parallelism to be impractical due to the space inefficiencies and the synchronization overheads of past approaches. This paper presents a novel and practical approach to automatically parallelizing doacross loops for execution on manycore-SMP systems. We introduce a compiler-and-runtime optimization called dependence folding that bounds the number of synchronization variables allocated per worker thread (processor core) to be at most the maximum depth of a loop nest being considered for automatic parallelization. Our approach has been implemented in a development version of the IBM XL Fortran V13.1 commercial parallelizing compiler and runtime system. For four benchmarks where automatic doall parallelization was largely ineffective (speedups of under 2×), our implementation delivered speedups of 6.5×, 9.0×, 17.3×, and 17.5× on a 32-core IBM Power7 SMP system, thereby showing that doacross parallelization can be a valuable technique to complement doall parallelization.