Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Binary Translation: Static, Dynamic, Retargetable?
ICSM '96 Proceedings of the 1996 International Conference on Software Maintenance
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Heterogeneous Chip Multiprocessors
Computer
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Merge: a programming model for heterogeneous multi-core systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Languages and Compilers for Parallel Computing
Programming model for a heterogeneous x86 platform
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A characterization and analysis of PTX kernels
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
IEEE Micro
State-of-the-art in heterogeneous computing
Scientific Programming
Supporting speculative parallelization in the presence of dynamic data structures
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Understanding throughput-oriented architectures
Communications of the ACM
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08 $$\times $$ 脳 in the best case and 1.42 $$\times $$ 脳 on average compared to the baseline GPU-only processing.