A processor architecture for horizon
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Conflict-Free Vector Access Using a Dynamic Storage Scheme
IEEE Transactions on Computers
IEEE Transactions on Computers
Proceedings of the 27th annual international symposium on Computer architecture
Advanced ASIC Chip Synthesis: Using Synopsys Design Compiler Physical Compiler and Prime Time
Advanced ASIC Chip Synthesis: Using Synopsys Design Compiler Physical Compiler and Prime Time
Conflict-Free Access for Streams in Multimodule Memories
IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Sams: single-affiliation multiple-stride parallel memory scheme
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
Proceedings of the 24th ACM International Conference on Supercomputing
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
One of the major problems with the GPU on-chip shared memory is bank conflicts. We observed that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel elastic pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed elastic pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0% (with 42.3% on average) and improves the overall performance by up to 20.7% (on average 13.3%) for our benchmark applications, at trivial hardware overhead.