Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Authors:
John A. Stratton;Vinod Grover;Jaydeep Marathe;Bastiaan Aarts;Mike Murphy;Ziang Hu;Wen-mei W. Hwu
Affiliations:
NVIDIA Corporation / University of Illinois at Urbana-Champaign, Champaign, IL, USA;NVIDIA Corporation, Santa Clara, CA, USA;NVIDIA Corporation, Santa Clara, CA, USA;NVIDIA Corporation, Santa Clara, CA, USA;NVIDIA Corporation, Santa Clara, CA, USA;NVIDIA Corporation, Santa Clara, CA, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Year:
2010

Citing 13
Cited 11

A bridging model for parallel computation

Communications of the ACM
Using processor affinity in loop scheduling on shared-memory multiprocessors

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A new algorithm for partial redundancy elimination based on SSA form

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Barrier inference

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Modeling the Global Internet

Computing in Science and Engineering
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Accelerating advanced mri reconstructions on gpus

Proceedings of the 5th conference on Computing frontiers
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient, portable implementation of asynchronous multi-place programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Chunking parallel loops in the presence of synchronization

Proceedings of the 23rd international conference on Supercomputing

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
Parametric flows: automated behavior equivalencing for symbolic analysis of races in CUDA programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Improving GPGPU concurrency with elastic kernels

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Throughput-oriented kernel porting onto FPGAs

Proceedings of the 50th Annual Design Automation Conference
Automatic OpenCL work-group size selection for multicore CPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
OpenCL framework for ARM processors with NEON support

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. Applications tested with our tool often showed performance parity with the compiled C version of the application for single-thread performance. With modest coarse-grained multithreading typical of today's CPU architectures, an average of 3.4x speedup on 4 processors was observed across the test applications.