CPU-assisted GPGPU on fused CPU-GPU architectures

Authors:
Yi Yang;Ping Xiang;Mike Mantor;Huiyang Zhou
Affiliations:
Department of Electrical and Computer Engineering, North Carolina State University;Department of Electrical and Computer Engineering, North Carolina State University;Graphics Products Group, Advanced Micro Devices;Department of Electrical and Computer Engineering, North Carolina State University
Venue:
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Year:
2012

Citing 0
Cited 6

Efficient scheduling of recursive control flow on GPUs

Proceedings of the 27th international ACM conference on International conference on supercomputing
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Design and implementation of the fusion simulator based on multi-shader GPU

Proceedings of the 2013 Research in Adaptive and Convergent Systems
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Journal of Parallel and Distributed Computing
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.