Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Authors:
Yoongu Kim;Michael Papamichael;Onur Mutlu;Mor Harchol-Balter
Affiliations:
-;-;-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 24
Cited 38

Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
The Impulse Memory Controller

IEEE Transactions on Computers
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Adaptive History-Based Memory Schedulers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
A study of performance impact of memory controller features in multi-processor server environment

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Effective Management of DRAM Bandwidth in Multicore Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
A Burst Scheduling Access Reordering Mechanism

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Memory performance attacks: denial of memory service in multi-core systems

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Distributed order scheduling and its application to multi-core dram controllers

Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Memory Access Scheduling Schemes for Systems with Multi-Core Processors

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Meeting points: using thread criticality to adapt multicore hardware to parallel regions

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Memory systems in the many-core era: challenges, opportunities, and solution directions

Proceedings of the international symposium on Memory management
Prefetch-aware shared resource management for multi-core systems

Proceedings of the 38th annual international symposium on Computer architecture
Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Parallel application memory scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems

ACM Transactions on Computer Systems (TOCS)
Multiple sub-row buffers in DRAM: unlocking performance and energy improvement opportunities

Proceedings of the 26th ACM international conference on Supercomputing
Unified memory optimizing architecture: memory subsystem control with a unified predictor

Proceedings of the 26th ACM international conference on Supercomputing
PARDIS: a programmable memory controller for the DDRx interfacing standards

Proceedings of the 39th Annual International Symposium on Computer Architecture
Towards energy-proportional datacenter memory with mobile DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
A case for exploiting subarray-level parallelism (SALP) in DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
Physically addressed queueing (PAQ): improving parallelism in solid state disks

Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Dynamic QoS management for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
The evicted-address filter: a unified mechanism to address both cache pollution and thrashing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A software memory partition approach for eliminating bank-level interference in multicore systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Off-chip access localization for NoC-based multicores

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Application-to-core mapping policies to reduce memory interference in multi-core systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Measuring interference between live datacenter applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Rethinking DRAM Power Modes for Energy Proportionality

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Addressing End-to-End Memory Access Latency in NoC-Based Multicores

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Conservative row activation to improve memory power efficiency

Proceedings of the 27th international ACM conference on International conference on supercomputing
CMP off-chip bandwidth scheduling guided by instruction criticality

Proceedings of the 27th international ACM conference on International conference on supercomputing
Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores

Proceedings of the Conference on Design, Automation and Test in Europe
Improving memory scheduling via processor-side load criticality information

Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
A heterogeneous multiple network-on-chip design: an application-aware approach

Proceedings of the 50th Annual Design Automation Conference
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Effect of page frame allocation pattern on bank conflicts in multi-core systems

Proceedings of the 2013 Research in Adaptive and Convergent Systems
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A programmable memory controller for the DDRx interfacing standards

ACM Transactions on Computer Systems (TOCS)
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Ubik: efficient cache sharing with strict qos for latency-critical workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Proceedings of Workshop on General Purpose Processing Using GPUs
BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).