OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.01 |
The presence of multiple MCs and their integration into the on-chip network fabric creates a highly concurrent system that can support significant levels of memory level parallelism (MLP) across cores. This work exposes the trade-off between DRAM parameters, bank level parallelism (BLP), and row buffer hit rate that exposes the amount of effective BLP that is necessary to approximate a 100% hit rate. We further study how this trade-off can be controlled and propose a class of global (system) and local (within an MC) address mappings that can be tuned to optimize the performance across a set of multiprogrammed benchmarks.