An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Proceedings of the 27th annual international symposium on Computer architecture
Symbiotic jobscheduling for a simultaneous multithreaded processor
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Enhancing memory level parallelism via recovery-free value prediction
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
AC/DC: An Adaptive Data Cache Prefetcher
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Reducing Cache Pollution via Dynamic Data Prefetch Filtering
IEEE Transactions on Computers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Memory Prefetching Using Adaptive Stream Detection
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Prefetch-aware shared resource management for multi-core systems
Proceedings of the 38th annual international symposium on Computer architecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A software memory partition approach for eliminating bank-level interference in multicore systems
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Application-aware prefetch prioritization in on-chip networks
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Resilient die-stacked DRAM caches
Proceedings of the 40th Annual International Symposium on Computer Architecture
Effect of page frame allocation pattern on bank conflicts in multi-core systems
Proceedings of the 2013 Research in Adaptive and Convergent Systems
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
DRAM systems achieve high performance when all DRAM banks are busy servicing useful memory requests. The degree to which DRAM banks are busy is called DRAM Bank-Level Parallelism (BLP). This paper proposes two new cost-effective mechanisms to maximize DRAM BLP. BLP-Aware Prefetch Issue (BAPI) issues prefetches into the on-chip Miss Status Holding Registers (MSHRs) associated with each core in a multi-core system such that the requests can be serviced in parallel in different DRAM banks. BLP-Preserving Multi-core Request Issue (BPMRI) does the actual loading of the DRAM controller's request buffers so that requests from the same core can be serviced in parallel, minimizing the serialization of each core's concurrent requests. When combined, BAPI and BPMRI improve system performance by 11.7% on a 4-core CMP system for a wide variety of multiprogrammed workloads. BAPI and BPMRI also complement various existing DRAM scheduling and prefetching algorithms, and can be used in conjunction with them.