Performance models for asynchronous data transfers on consumer Graphics Processing Units

Authors:
Juan GóMez-Luna;José MaríA GonzáLez-Linares;José Ignacio Benavides;NicoláS Guil
Affiliations:
Department of Computer Architecture and Electronics, University of Córdoba, Spain;Department of Computer Architecture, University of Málaga, Spain;Department of Computer Architecture and Electronics, University of Córdoba, Spain;Department of Computer Architecture, University of Málaga, Spain
Venue:
Journal of Parallel and Distributed Computing
Year:
2012

Citing 8
Cited 3

Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Parallelization of a Video Segmentation Algorithm on CUDA---Enabled Graphics Processing Units

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture

Editorial: Special issue editorial: Accelerators for high-performance computing

Journal of Parallel and Distributed Computing
Real-time recovery of moving 3D faces for emerging applications

Computers in Industry
In-place transposition of rectangular matrices on accelerators

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics Processing Units (GPU) have impressively arisen as general-purpose coprocessors in high performance computing applications, since the launch of the Compute Unified Device Architecture (CUDA). However, they present an inherent performance bottleneck in the fact that communication between two separate address spaces (the main memory of the CPU and the memory of the GPU) is unavoidable. The CUDA Application Programming Interface (API) provides asynchronous transfers and streams, which permit a staged execution, as a way to overlap communication and computation. Nevertheless, a precise manner to estimate the possible improvement due to overlapping does not exist, neither a rule to determine the optimal number of stages or streams in which computation should be divided. In this work, we present a methodology that is applied to model the performance of asynchronous data transfers of CUDA streams on different GPU architectures. Thus, we illustrate this methodology by deriving expressions of performance for two different consumer graphic architectures belonging to the more recent generations. These models permit programmers to estimate the optimal number of streams in which the computation on the GPU should be broken up, in order to obtain the highest performance improvements. Finally, we have checked the suitability of our performance models with three applications based on codes from the CUDA Software Development Kit (SDK) with successful results.