Approximate algorithms scheduling parallelizable tasks
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Solution of a problem in concurrent programming control
Communications of the ACM
Non-blocking steal-half work queues
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
A dynamic-sized nonblocking work stealing deque
Distributed Computing - Special issue: DISC 04
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors
Proceedings of the 2007 international workshop on Parallel symbolic computation
Intel® threading building blocks
Journal of Computing Sciences in Colleges
On dynamic load balancing on graphics processors
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Deque-Free Work-Optimal Parallel STL Algorithms
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Task management for irregular-parallel workloads on the GPU
Proceedings of the Conference on High Performance Graphics
Hi-index | 0.00 |
Graphics Processing units (GPU) have become a valuable support for High Performance Computing (HPC) applications. However, despite the many improvements of General Purpose GPUs, the current programming paradigms available, such as NVIDIA's CUDA, are still low-level and require strong programming effort, especially for irregular applications where dynamic load balancing is a key point to reach high performances. This paper introduces a new hybrid programming scheme for general purpose graphics processors using two levels of parallelism. In the upper level, a program creates, in a lazy fashion, tasks to be scheduled on the different Streaming Multiprocessors (MP), as defined in the NVIDIA's architecture. We have embedded inside GPU a well-known work stealing algorithm to dynamically balance the workload. At lower level, tasks exploit each Streaming Processor (SP) following a data-parallel approach. Preliminary comparisons on data-parallel iteration over vectors show that this approach is competitive on regular workload over the standard CUDA library Thrust, based on a static scheduling. Nevertheless, our approach outperforms Thrust-based scheduling on irregular workloads.