Elastic pipeline: addressing GPU on-chip shared memory bank conflicts

Authors:
Chunyang Gou;Georgi N. Gaydadjiev
Affiliations:
Delft University of Technology, The Netherlands;Delft University of Technology, The Netherlands
Venue:
Proceedings of the 8th ACM International Conference on Computing Frontiers
Year:
2011

Citing 14
Cited 1

A processor architecture for horizon

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Conflict-Free Vector Access Using a Dynamic Storage Scheme

IEEE Transactions on Computers
Increased Memory Performance During Vector Accesses Through the Use of Linear Address Transformations

IEEE Transactions on Computers
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Advanced ASIC Chip Synthesis: Using Synopsys Design Compiler Physical Compiler and Prime Time

Advanced ASIC Chip Synthesis: Using Synopsys Design Compiler Physical Compiler and Prime Time
Conflict-Free Access for Streams in Multimodule Memories

IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Sams: single-affiliation multiple-stride parallel memory scheme

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the major problems with the GPU on-chip shared memory is bank conflicts. We observed that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed elastic pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves the overall performance by up to 20.7% (on average 13.3%) for our benchmark applications, at trivial hardware overhead.