A new programming paradigm for GPGPU

Authors:
Julio Toss;Thierry Gautier
Affiliations:
Institute of Informatics, UFRGS, Porto Alegre, RS, Brasil;INRIA, MOAIS, LIG, Grenoble, France
Venue:
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Year:
2012

Citing 12
Cited 0

Approximate algorithms scheduling parallelizable tasks

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Solution of a problem in concurrent programming control

Communications of the ACM
Non-blocking steal-half work queues

Proceedings of the twenty-first annual symposium on Principles of distributed computing
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
A dynamic-sized nonblocking work stealing deque

Distributed Computing - Special issue: DISC 04
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors

Proceedings of the 2007 international workshop on Parallel symbolic computation
Intel® threading building blocks

Journal of Computing Sciences in Colleges
On dynamic load balancing on graphics processors

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Deque-Free Work-Optimal Parallel STL Algorithms

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Task management for irregular-parallel workloads on the GPU

Proceedings of the Conference on High Performance Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics Processing units (GPU) have become a valuable support for High Performance Computing (HPC) applications. However, despite the many improvements of General Purpose GPUs, the current programming paradigms available, such as NVIDIA's CUDA, are still low-level and require strong programming effort, especially for irregular applications where dynamic load balancing is a key point to reach high performances. This paper introduces a new hybrid programming scheme for general purpose graphics processors using two levels of parallelism. In the upper level, a program creates, in a lazy fashion, tasks to be scheduled on the different Streaming Multiprocessors (MP), as defined in the NVIDIA's architecture. We have embedded inside GPU a well-known work stealing algorithm to dynamically balance the workload. At lower level, tasks exploit each Streaming Processor (SP) following a data-parallel approach. Preliminary comparisons on data-parallel iteration over vectors show that this approach is competitive on regular workload over the standard CUDA library Thrust, based on a static scheduling. Nevertheless, our approach outperforms Thrust-based scheduling on irregular workloads.