Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Authors:
Rachata Ausavarungnirun;Kevin Kai-Wei Chang;Lavanya Subramanian;Gabriel H. Loh;Onur Mutlu
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Advanced Micro Devices, Inc.;Carnegie Mellon University
Venue:
Proceedings of the 39th Annual International Symposium on Computer Architecture
Year:
2012

Citing 28
Cited 10

Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Effective Management of DRAM Bandwidth in Multicore Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Memory performance attacks: denial of memory service in multi-core systems

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
System-Level Performance Metrics for Multiprogram Workloads

IEEE Micro
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Application-aware prioritization mechanisms for on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Bobcat: AMD's Low-Power x86 Processor

IEEE Micro
Fairness Metrics for Multi-Threaded Processors

IEEE Computer Architecture Letters
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Parallel application memory scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC

Proceedings of the 49th Annual Design Automation Conference
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function

IEEE Computer Architecture Letters

A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC

Proceedings of the 49th Annual Design Automation Conference
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Conservative row activation to improve memory power efficiency

Proceedings of the 27th international ACM conference on International conference on supercomputing
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
A heterogeneous multiple network-on-chip design: an application-aware approach

Proceedings of the 50th Annual Design Automation Conference
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Journal of Parallel and Distributed Computing
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.