Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Exploiting spatial locality in data caches using spatial footprints
Proceedings of the 25th annual international symposium on Computer architecture
Exploring the Design Space of Future CMPs
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
An Architectural Evaluation of Java TPC-W
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A Single Chip Multiprocessor Integrated with High Density DRAM
A Single Chip Multiprocessor Integrated with High Density DRAM
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance
Proceedings of the 31st annual international symposium on Computer architecture
Adaptive Cache Compression for High-Performance Processors
Proceedings of the 31st annual international symposium on Computer architecture
Accurate and Complexity-Effective Spatial Pattern Prediction
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Implementing Caches in a 3D Technology for High Performance Processors
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Heterogeneous Chip Multiprocessors
Computer
Die Stacking (3D) Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Using compression to improve chip multiprocessor performance
Using compression to improve chip multiprocessor performance
Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
IBM Journal of Research and Development
Memory-Link Compression Schemes: A Value Locality Perspective
IEEE Transactions on Computers
Amdahl's Law in the Multicore Era
Computer
Performance Studies of Commercial Workloads on a Multi-core System
IISWC '07 Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization
Is 3D chip technology the next growth engine for performance improvement?
IBM Journal of Research and Development
Rethinking DRAM design and organization for energy-constrained multi-cores
Proceedings of the 37th annual international symposium on Computer architecture
Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms
Proceedings of the 47th Design Automation Conference
PM-COSYN: PE and memory co-synthesis for MPSoCs
Proceedings of the Conference on Design, Automation and Test in Europe
High speed network traffic analysis with commodity multi-core systems
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Understanding the behavior and implications of context switch misses
ACM Transactions on Architecture and Code Optimization (TACO)
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Extended histories: improving regularity and performance in correlation prefetchers
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Memory-, bandwidth-, and power-aware multi-core for a graph database workload
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Deferred gratification: engineering for high performance garbage collection from the get go
Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Cost-effectively offering private buffers in SoCs and CMPs
Proceedings of the international conference on Supercomputing
Moguls: a model to explore the memory hierarchy for bandwidth improvements
Proceedings of the 38th annual international symposium on Computer architecture
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Processor caches with multi-level spin-transfer torque ram cells
Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Buffer-integrated-Cache: a cost-effective SRAM architecture for handheld and embedded platforms
Proceedings of the 48th Design Automation Conference
Why nothing matters: the impact of zeroing
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
DAPSCO: Distance-aware partially shared cache organization
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Bandwidth-aware reconfigurable cache design with hybrid memory technologies
Proceedings of the International Conference on Computer-Aided Design
Towards energy-proportional datacenter memory with mobile DRAM
Proceedings of the 39th Annual International Symposium on Computer Architecture
A software memory partition approach for eliminating bank-level interference in multicore systems
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS)
On-chip caches built on multilevel spin-transfer torque RAM cells and its optimizations
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special issue on memory technologies
Addressing the challenges of future large-scale many-core architectures
Proceedings of the ACM International Conference on Computing Frontiers
Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores
Proceedings of the Conference on Design, Automation and Test in Europe
3D integration for power-efficient computing
Proceedings of the Conference on Design, Automation and Test in Europe
Proceedings of the 40th Annual International Symposium on Computer Architecture
Studying multicore processor scaling via reuse distance analysis
Proceedings of the 40th Annual International Symposium on Computer Architecture
Dynamic cache management in multi-core architectures through run-time adaptation
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Distributed fair DRAM scheduling in network-on-chips architecture
Journal of Systems Architecture: the EUROMICRO Journal
MMSoC: a multi-layer multi-core storage-on-chip design for systems with high integration
Proceedings of the 14th International Conference on Computer Systems and Technologies
Memory-centric system interconnect design with hybrid memory cubes
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Reducing inter-core cache contention with an adaptive bank mapping policy in DRAM cache
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
As transistor density continues to grow at an exponential rate in accordance to Moore's law, the goal for many Chip Multi-Processor (CMP) systems is to scale the number of on-chip cores proportionally. Unfortunately, off-chip memory bandwidth capacity is projected to grow slowly compared to the desired growth in the number of cores. This creates a situation in which each core will have a decreasing amount of off-chip bandwidth that it can use to load its data from off-chip memory. The situation in which off-chip bandwidth is becoming a performance and throughput bottleneck is referred to as the bandwidth wall problem. In this study, we seek to answer two questions: (1) to what extent does the bandwidth wall problem restrict future multicore scaling, and (2) to what extent are various bandwidth conservation techniques able to mitigate this problem. To address them, we develop a simple but powerful analytical model to predict the number of on-chip cores that a CMP can support given a limited growth in memory traffic capacity. We find that the bandwidth wall can severely limit core scaling. When starting with a balanced 8-core CMP, in four technology generations the number of cores can only scale to 24, as opposed to 128 cores under proportional scaling, without increasing the memory traffic requirement. We find that various individual bandwidth conservation techniques we evaluate have a wide ranging impact on core scaling, and when combined together, these techniques have the potential to enable super-proportional core scaling for up to 4 technology generations.