A bridging model for parallel computation
Communications of the ACM
Using processor affinity in loop scheduling on shared-memory multiprocessors
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A new algorithm for partial redundancy elimination based on SSA form
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computing in Science and Engineering
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors
Proceedings of the International Symposium on Code Generation and Optimization
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Accelerating advanced mri reconstructions on gpus
Proceedings of the 5th conference on Computing frontiers
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient, portable implementation of asynchronous multi-place programs
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Chunking parallel loops in the presence of synchronization
Proceedings of the 23rd international conference on Supercomputing
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Proceedings of the 26th ACM international conference on Supercomputing
Parametric flows: automated behavior equivalencing for symbolic analysis of races in CUDA programs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Improving GPGPU concurrency with elastic kernels
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Microarchitectural mechanisms to exploit value structure in SIMT architectures
Proceedings of the 40th Annual International Symposium on Computer Architecture
Throughput-oriented kernel porting onto FPGAs
Proceedings of the 50th Annual Design Automation Conference
Automatic OpenCL work-group size selection for multicore CPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ACM Transactions on Programming Languages and Systems (TOPLAS)
OpenCL framework for ARM processors with NEON support
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Hi-index | 0.00 |
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. Applications tested with our tool often showed performance parity with the compiled C version of the application for single-thread performance. With modest coarse-grained multithreading typical of today's CPU architectures, an average of 3.4x speedup on 4 processors was observed across the test applications.