Nested data-parallelism on the gpu

Authors:
Lars Bergstrom;John Reppy
Affiliations:
University of Chicago, Chicago, IL, USA;University of Chicago, Chicago, IL, USA
Venue:
Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
Year:
2012

Citing 23
Cited 4

Compiling collection-oriented languages onto massively parallel computers

Journal of Parallel and Distributed Computing - Massively parallel computation
Compiling nested data-parallel programs for shared-memory multiprocessors

ACM Transactions on Programming Languages and Systems (TOPLAS)
Implementation of a portable nested data-parallel language

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
Optimizing an ANSI C interpreter with superoperators

POPL '95 Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Programming parallel algorithms

Communications of the ACM
The quickhull algorithm for convex hulls

ACM Transactions on Mathematical Software (TOMS)
Nepal - Nested Data Parallelism in Haskell

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Work-efficient nested data-parallelism

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
SAC: a functional array language for efficient multi-threaded execution

International Journal of Parallel Programming
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Stack-based parallel recursion on graphics processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
OptiX: a general purpose ray tracing engine

ACM SIGGRAPH 2010 papers
Nikola: embedding compiled GPU functions in Haskell

Proceedings of the third ACM Haskell symposium on Haskell
Lazy tree splitting

Proceedings of the 15th ACM SIGPLAN international conference on Functional programming
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Breaking the GPU programming barrier with the auto-parallelising SAC compiler

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Simple optimizations for an applicative array language for graphics processors

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
A new method for GPU based irregular reductions and its application to k-means clustering

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Higher order flattening

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
GPU programming in a high level language: compiling X10 to CUDA

Proceedings of the 2011 ACM SIGPLAN X10 Workshop

Vectorisation avoidance

Proceedings of the 2012 Haskell Symposium
Optimising purely functional GPU programs

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
A T2 graph-reduction approach to fusion

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Towards a streaming model for nested data parallelism

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors. NESL is a first-order functional language that was designed to allow programmers to write irregular-parallel programs - such as parallel divide-and-conquer algorithms - for wide-vector parallel computers. This paper presents our port of the NESL implementation to work on GPUs and provides empirical evidence that nested data-parallelism (NDP) on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. While our performance does not match that of hand-tuned CUDA programs, we argue that the notational conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly supports NDP on a GPU.