Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Programming with POSIX threads
Programming with POSIX threads
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Proceedings of the ACM 2000 conference on Java Grande
OpenMP: parallel programming API for shared memory multiprocessors and on-chip multiprocessors
Proceedings of the 15th international symposium on System Synthesis
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Intel® threading building blocks
Journal of Computing Sciences in Colleges
Proceedings of the Conference on Design, Automation and Test in Europe
Fast and lightweight support for nested parallelism on cluster-based embedded many-cores
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Proceedings of the Conference on Design, Automation and Test in Europe
HARS: A hardware-assisted runtime software for embedded many-core architectures
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Embedded architectures are moving to multi-core and many-core concepts in order to sustain ever growing computing requirements within complexity and power budgets. Programming many-core architectures not only needs parallel programming skills, but also efficient exploitation of fine grain parallelism at both architecture and runtime levels. Scheduler reactivity is however increasingly important as tasks granularity is reduced, in order to keep the overhead of the scheduling to a minimum. This paper presents a lightweight fork-join framework for scheduling fine grain parallel tasks on embedded many-core systems. The asynchronous nature of the fork-join model used in this framework permits to dramatically decrease its scheduling overhead. Experimentation conducted in this paper show that the overhead induced by this framework is of 33 cycles per scheduled task. Also, we show that near-ideal speedup can be obtained by the ARTM framework for data parallel applications and that ARTM achieves better results than other state of the art parallelization techniques.