Scaling the bandwidth wall: challenges in and avenues for CMP scaling

Authors:
Brian M. Rogers;Anil Krishna;Gordon B. Bell;Ken Vu;Xiaowei Jiang;Yan Solihin
Affiliations:
North Carolina State University, Raleigh, NC, USA;IBM, Research Triangle Park, NC, USA;IBM, Research Triangle Park, NC, USA;IBM, Research Triangle Park, NC, USA;North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 19
Cited 36

Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Exploiting spatial locality in data caches using spatial footprints

Proceedings of the 25th annual international symposium on Computer architecture
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
An Architectural Evaluation of Java TPC-W

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A Single Chip Multiprocessor Integrated with High Density DRAM

A Single Chip Multiprocessor Integrated with High Density DRAM
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Proceedings of the 31st annual international symposium on Computer architecture
Adaptive Cache Compression for High-Performance Processors

Proceedings of the 31st annual international symposium on Computer architecture
Accurate and Complexity-Effective Spatial Pattern Prediction

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Implementing Caches in a 3D Technology for High Performance Processors

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Heterogeneous Chip Multiprocessors

Computer
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Using compression to improve chip multiprocessor performance

Using compression to improve chip multiprocessor performance
Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Memory-Link Compression Schemes: A Value Locality Perspective

IEEE Transactions on Computers
Amdahl's Law in the Multicore Era

Computer
Performance Studies of Commercial Workloads on a Multi-core System

IISWC '07 Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization
Is 3D chip technology the next growth engine for performance improvement?

IBM Journal of Research and Development

Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms

Proceedings of the 47th Design Automation Conference
PM-COSYN: PE and memory co-synthesis for MPSoCs

Proceedings of the Conference on Design, Automation and Test in Europe
High speed network traffic analysis with commodity multi-core systems

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Understanding the behavior and implications of context switch misses

ACM Transactions on Architecture and Code Optimization (TACO)
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Extended histories: improving regularity and performance in correlation prefetchers

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Memory-, bandwidth-, and power-aware multi-core for a graph database workload

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Deferred gratification: engineering for high performance garbage collection from the get go

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Cost-effectively offering private buffers in SoCs and CMPs

Proceedings of the international conference on Supercomputing
Moguls: a model to explore the memory hierarchy for bandwidth improvements

Proceedings of the 38th annual international symposium on Computer architecture
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Processor caches with multi-level spin-transfer torque ram cells

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Buffer-integrated-Cache: a cost-effective SRAM architecture for handheld and embedded platforms

Proceedings of the 48th Design Automation Conference
Why nothing matters: the impact of zeroing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
DAPSCO: Distance-aware partially shared cache organization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Bandwidth-aware reconfigurable cache design with hybrid memory technologies

Proceedings of the International Conference on Computer-Aided Design
Towards energy-proportional datacenter memory with mobile DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
A software memory partition approach for eliminating bank-level interference in multicore systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

ACM Transactions on Computer Systems (TOCS)
On-chip caches built on multilevel spin-transfer torque RAM cells and its optimizations

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special issue on memory technologies
Addressing the challenges of future large-scale many-core architectures

Proceedings of the ACM International Conference on Computing Frontiers
Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores

Proceedings of the Conference on Design, Automation and Test in Europe
3D integration for power-efficient computing

Proceedings of the Conference on Design, Automation and Test in Europe
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

Proceedings of the 40th Annual International Symposium on Computer Architecture
Studying multicore processor scaling via reuse distance analysis

Proceedings of the 40th Annual International Symposium on Computer Architecture
Dynamic cache management in multi-core architectures through run-time adaptation

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Distributed fair DRAM scheduling in network-on-chips architecture

Journal of Systems Architecture: the EUROMICRO Journal
MMSoC: a multi-layer multi-core storage-on-chip design for systems with high integration

Proceedings of the 14th International Conference on Computer Systems and Technologies
Memory-centric system interconnect design with hybrid memory cubes

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Reducing inter-core cache contention with an adaptive bank mapping policy in DRAM cache

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
The effect of communication and synchronization on Amdahl's law in multicore systems

Parallel Computing
BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

As transistor density continues to grow at an exponential rate in accordance to Moore's law, the goal for many Chip Multi-Processor (CMP) systems is to scale the number of on-chip cores proportionally. Unfortunately, off-chip memory bandwidth capacity is projected to grow slowly compared to the desired growth in the number of cores. This creates a situation in which each core will have a decreasing amount of off-chip bandwidth that it can use to load its data from off-chip memory. The situation in which off-chip bandwidth is becoming a performance and throughput bottleneck is referred to as the bandwidth wall problem. In this study, we seek to answer two questions: (1) to what extent does the bandwidth wall problem restrict future multicore scaling, and (2) to what extent are various bandwidth conservation techniques able to mitigate this problem. To address them, we develop a simple but powerful analytical model to predict the number of on-chip cores that a CMP can support given a limited growth in memory traffic capacity. We find that the bandwidth wall can severely limit core scaling. When starting with a balanced 8-core CMP, in four technology generations the number of cores can only scale to 24, as opposed to 128 cores under proportional scaling, without increasing the memory traffic requirement. We find that various individual bandwidth conservation techniques we evaluate have a wide ranging impact on core scaling, and when combined together, these techniques have the potential to enable super-proportional core scaling for up to 4 technology generations.