Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Space-efficient scheduling of nested parallelism
ACM Transactions on Programming Languages and Systems (TOPLAS)
OpenMP Extensions for Thread Groups and Their Run-Time Support
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Performance Evaluation of OpenMP Applications with Nested Parallelism
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
A microbenchmark study of OpenMP overheads under nested parallelism
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Nested parallelism in the OMPI OpenmP/C compiler
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Variation-tolerant OpenMP tasking on tightly-coupled processor clusters
Proceedings of the Conference on Design, Automation and Test in Europe
Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters
Proceedings of the Conference on Design, Automation and Test in Europe
ARTM: a lightweight fork-join framework for many-core embedded systems
Proceedings of the Conference on Design, Automation and Test in Europe
Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP
Proceedings of the First International Workshop on Many-core Embedded Systems
HARS: A hardware-assisted runtime software for embedded many-core architectures
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Several recent many-core accelerators have been architected as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used -- with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level -- which make memory operations non-uniform (NUMA). Nested parallelism represents a powerful programming abstraction for these architectures, where a first level of parallelism can be used to distribute coarse-grained tasks to clusters, and additional levels of fine-grained parallelism can be distributed to processors within a cluster. This paper presents a lightweight and highly optimized support for nested parallelism on cluster-based embedded many-cores. We assess the costs to enable multi-level parallelization and demonstrate that our techniques allow to extract high degrees of parallelism.