CudaDMA: optimizing GPU memory bandwidth via warp specialization

Authors:
Michael Bauer;Henry Cook;Brucek Khailany
Affiliations:
Stanford University;UC Berkeley;NVIDIA Research
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 12
Cited 12

The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Revisiting sorting for GPGPU stream architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
A domain-specific approach to heterogeneous parallelism

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Accelerating GPU kernels for dense linear algebra

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science

GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Memory reuse optimizations in the R-Stream compiler

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimized GPU implementation and performance analysis of HC series of stream ciphers

ICISC'12 Proceedings of the 15th international conference on Information Security and Cryptology
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hardware acceleration of database operations

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Singe: leveraging warp specialization for high performance on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the computational power of GPUs continues to scale with Moore's Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic microbenchmarks, and 1.15x-3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs.