Shared memory multiplexing: a novel way to improve GPGPU throughput

Authors:
Yi Yang;Ping Xiang;Mike Mantor;Norm Rubin;Huiyang Zhou
Affiliations:
North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA;AMD Inc., Orlando, FL, USA;AMD Inc., Orlando, FL, USA;North Carolina State University, Raleigh, NC, USA
Venue:
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Year:
2012

Citing 12
Cited 3

ISPASS 2005 IEEE International Symposium on Performance Analysis of Systems and Software

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
StoreGPU: exploiting graphics processing units to accelerate distributed storage systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

IEEE Transactions on Parallel and Distributed Systems
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs

ICPP '12 Proceedings of the 2012 41st International Conference on Parallel Processing

Cost-effective soft-error protection for SRAM-based structures in GPGPUs

Proceedings of the ACM International Conference on Computing Frontiers
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this paper, we propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks. Our proposed approaches are based on our observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block. If the shared memory is allocated only when it is actually used and freed immediately after, more thread blocks can be hosted in an SM without increasing the shared memory capacity. We propose three software approaches to enable shared memory multiplexing and implement them using a source-to-source compiler. The experimental results show that our proposed software approaches effectively improve the throughput of many GPGPU applications on both NVIDIA GTX285 and GTX480 GPUs (an average of 1.44X on GTX285, 1.70X on GTX480 with 16kB shared memory, and 1.26X on GTX480 with 48kB shared memory). We also propose hardware support for shared memory multiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements (an average of 1.53X) to be achieved with very little change in GPGPU code.