BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

Authors:
Lei Liu;Zehan Cui;Yong Li;Yungang Bao;Mingyu Chen;Chengyong Wu
Affiliations:
State Key Laboratory of Computer Architecture, ICT, CAS, China;State Key Laboratory of Computer Architecture, ICT, CAS, China/ Graduate School, CAS, China;Department of ECE, University of Pittsburgh;State Key Laboratory of Computer Architecture, ICT, CAS, China;State Key Laboratory of Computer Architecture, ICT, CAS, China;State Key Laboratory of Computer Architecture, ICT, CAS, China
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2014

Citing 39
Cited 0

Optimal Partitioning of Cache Memory

IEEE Transactions on Computers
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
OS-Controlled Cache Predictability for Real-Time Systems

RTAS '97 Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium (RTAS '97)
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
A study of performance impact of memory controller features in multi-processor server environment

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
QoS policies and architecture for cache/memory in CMP platforms

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Memory scheduling for modern microprocessors

ACM Transactions on Computer Systems (TOCS)
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Memory performance attacks: denial of memory service in multi-core systems

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
HMTT: a platform independent full-system memory trace monitoring system

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using OS Observations to Improve Performance in Multicore Systems

IEEE Micro
Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review
Scaling the bandwidth wall: challenges in and avenues for CMP scaling

Proceedings of the 36th annual international symposium on Computer architecture
vGreen: a system for energy efficient computing in virtualized environments

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Improving memory bank-level parallelism in the presence of prefetching

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Re-architecting DRAM memory systems with monolithically integrated silicon photonics

Proceedings of the 37th annual international symposium on Computer architecture
Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
Hardware execution throttling for multi-core resource management

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Software-hardware cooperative DRAM bank partitioning for chip multiprocessors

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Rapid identification of architectural bottlenecks via precise event counting

Proceedings of the 38th annual international symposium on Computer architecture
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Balancing DRAM locality and parallelism in shared memory CMP systems

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
A software memory partition approach for eliminating bank-level interference in multicore systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Application-to-core mapping policies to reduce memory system interference in multi-core systems

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
MISE: Providing performance predictability and improving fairness in shared main memory systems

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses

CSE '13 Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main memory system is a shared resource in modern multicore machines that can result in serious interference leading to reduced throughput and unfairness. Many new memory scheduling mechanisms have been proposed to address the interference problem. However, these mechanisms usually employ relative complex scheduling logic and need modifications to Memory Controllers (MCs), which incur expensive hardware design and manufacturing overheads. This article presents a practical software approach to effectively eliminate the interference without any hardware modifications. The key idea is to modify the OS memory management system and adopt a page-coloring-based Bank-level Partitioning Mechanism (BPM) that allocates dedicated DRAM banks to each core (or thread). By using BPM, memory requests from distinct programs are segregated across multiple memory banks to promote locality/fairness and reduce interference. We further extend BPM to BPM+ by incorporating channel-level partitioning, on which we demonstrate additional gain over BPM in many cases. To achieve benefits in the presence of diverse application memory needs and avoid performance degradation due to resource underutilization, we propose a dynamic mechanism upon BPM/BPM+ that assigns appropriate bank/channel resources based on application memory/bandwidth demands monitored through PMU (performance-monitoring unit) and a low-overhead OS page table scanning process. We implement BPM/BPM+ in Linux 2.6.32.15 kernel and evaluate the technique on four-core and eight-core real machines by running a large amount of randomly generated multiprogrammed and multithreaded workloads. Experimental results show that BPM/BPM+ can improve the overall system throughput by 4.7%/5.9%, on average, (up to 8.6%/9.5%) and reduce the unfairness by an average of 4.2%/6.1% (up to 15.8%/13.9%).