Memory bandwidth limitations of future microprocessors

Authors:
Doug Burger;James R. Goodman;Alain Kägi
Affiliations:
Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, Wisconsin;Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, Wisconsin;Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, Wisconsin
Venue:
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Year:
1996

Citing 35
Cited 105

A VLIW architecture for a trace scheduling compiler

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Analysis of memory referencing behavior for design of local memories

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Dynamic base register caching: a technique for reducing address bus width

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Software support for speculative loads

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Cache write policies and performance

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Wisconsin Architectural Research Tool Set

ACM SIGARCH Computer Architecture News
Efficient simulation of caches under optimal replacement with applications to miss characterization

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Memory in the fast lane

IEEE Spectrum
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Interleaving: a multithreading technique targeting multiprocessors and workstations

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A limit study of local memory requirements using value reuse profiles

Proceedings of the 28th annual international symposium on Microarchitecture
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Index Register Allocation

Journal of the ACM (JACM)
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Cache performance in vector supercomputers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Using cache memory to reduce processor-memory traffic

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Experimental evaluation of on-chip microprocessor cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Creating a wider bus using caching techniques

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The intrinsic bandwidth requirements of ordinary programs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Adaptive page replacement based on memory reference behavior

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data caches for superscalar processors

ICS '97 Proceedings of the 11th international conference on Supercomputing
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
DataScalar architectures

Proceedings of the 24th annual international symposium on Computer architecture
Characterization and improvement of load/store cache-based prefetching

ICS '98 Proceedings of the 12th international conference on Supercomputing
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Exploiting spatial locality in data caches using spatial footprints

Proceedings of the 25th annual international symposium on Computer architecture
Retrospective: using cache memory to reduce processor-memory traffic

25 years of the international symposia on Computer architecture (selected papers)
Investigating optimal local memory performance

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
An Algorithm for Optimally Exploiting Spatial and Temporal Locality in Upper Memory Levels

IEEE Transactions on Computers - Special issue on cache memory and related problems
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
A locality sensitive multi-module cache with explicit management

ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The processor-memory bottleneck: problems and solutions

Crossroads - Computer architecture
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Are wires plannable?

Proceedings of the 2001 international workshop on System-level interconnect prediction
The Impulse Memory Controller

IEEE Transactions on Computers
High-Performance DRAMs in Workstation Environments

IEEE Transactions on Computers
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
The architecture of the DIVA processing-in-memory chip

ICS '02 Proceedings of the 16th international conference on Supercomputing
Avoiding initialization misses to the heap

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Early cancellation: an active NIC optimization for time-warp

Proceedings of the sixteenth workshop on Parallel and distributed simulation
Two techniques for reconciling algorithm parallelism with memory constraints

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Software-controlled on-chip memory for high-performance and low-power computing

ACM SIGARCH Computer Architecture News
Challenges in physical chip design

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
A Case for Intelligent RAM

IEEE Micro
Limited Bandwidth to Affect Processor Design

IEEE Micro
Cached DRAM for ILP Processor Memory Access Latency Reduction

IEEE Micro
Analysis of performance bottlenecks in multithreaded multiprocessor systems

Fundamenta Informaticae - Application of concurrency to system design
Increasing hardware data prefetching performance using the second-level cache

Journal of Systems Architecture: the EUROMICRO Journal
Data remapping for design space optimization of embedded memory systems

ACM Transactions on Embedded Computing Systems (TECS)
Using Programmable NICs for Time-Warp Optimization

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Architectural Support for Data-intensive Applications

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Software Controlled Reconfigurable On-Chip Memory for High Performance Computing

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Memory System Support for Irregular Applications

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
FCCMS and the Memory Wall

FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
SCIMA: Software Controlled Integrated Memory Architecture for High Performance Computing

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
SCIMA: A Novel Architecture for High Performance Computing

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
A Case for Studying DRAM Issues at the System Level

IEEE Micro
VL-CDRAM: variable line sized cached DRAMs

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Optimal Replacement Is NP-Hardfor Nonstandard Caches

IEEE Transactions on Computers
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Dynamic techniques to reduce memory traffic in embedded systems

Proceedings of the 1st conference on Computing frontiers
Accelerating the Kernels of BLAST with an Efficient PIM (Processor-In-Memory) Architecture

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

Journal of VLSI Signal Processing Systems
Exploiting the replication cache to improve performance for multiple-issue microprocessors

ACM SIGARCH Computer Architecture News - Special issue: MEDEA 2004 workshop
Performance characteristics of MAUI: an intelligent memory system architecture

Proceedings of the 2005 workshop on Memory system performance
Processing-in-memory technology for knowledge discovery algorithms

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Exploiting the replication cache to improve cache read bandwidth cost effectively

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Impulse: Memory system support for scientific applications

Scientific Programming
Increasing cache capacity through word filtering

Proceedings of the 21st annual international conference on Supercomputing
Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Proceedings of the 5th conference on Computing frontiers
Recognition and Optimization of Loop-Carried Stream Reusing of Scientific Computing Applications on the Stream Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
P-OPT: Program-Directed Optimal Cache Management

Languages and Compilers for Parallel Computing
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices

Proceedings of the 36th annual international symposium on Computer architecture
Scaling the bandwidth wall: challenges in and avenues for CMP scaling

Proceedings of the 36th annual international symposium on Computer architecture
SSE Implementation of Multivariate PKCs on Modern x86 CPUs

CHES '09 Proceedings of the 11th International Workshop on Cryptographic Hardware and Embedded Systems
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
An analytical model to exploit memory task scheduling

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Optimizing stream organization to improve the performance of scientific computing applications on the stream processor

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
Scaling power/ground solvers on multi-core with memory bandwidth awareness

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing

Proceedings of the 37th annual international symposium on Computer architecture
Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms

Proceedings of the 47th Design Automation Conference
On mitigating memory bandwidth contention through bandwidth-aware scheduling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Engineering scalable, cache and space efficient tries for strings

The VLDB Journal — The International Journal on Very Large Data Bases
A 3-D cache with ultra-wide data bus for 3-D processor-memory integration

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
Memory Latency Reduction via Thread Throttling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
2.5-Dimensional VLSI system integration

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Minimal perfect hashing: A competitive method for indexing internal memory

Information Sciences: an International Journal
Loop Distribution and Fusion with Timing and Code Size Optimization

Journal of Signal Processing Systems
Deferred gratification: engineering for high performance garbage collection from the get go

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Parallelism and data movement characterization of contemporary application classes

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Moguls: a model to explore the memory hierarchy for bandwidth improvements

Proceedings of the 38th annual international symposium on Computer architecture
Pinned to the walls: impact of packaging and application properties on the memory and power walls

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
A helper thread based dynamic cache partitioning scheme for multithreaded applications

Proceedings of the 48th Design Automation Conference
Why nothing matters: the impact of zeroing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Scientific computing applications on the imagine stream processor

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Bandwidth-aware reconfigurable cache design with hybrid memory technologies

Proceedings of the International Conference on Computer-Aided Design
A memory bandwidth effective cache store miss policy

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
A high performance adaptive miss handling architecture for chip multiprocessors

Transactions on High-Performance Embedded Architectures and Compilers IV
SRP: symbiotic resource partitioning of the memory hierarchy in CMPs

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Trends and challenges in operating systems---from parallel computing to cloud computing

Concurrency and Computation: Practice & Experience
Analysis of Performance Bottlenecks in Multithreaded Multiprocessor Systems

Fundamenta Informaticae - Application of Concurrency to System Design
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Parallel Computation of Adaptive Filtering Algorithms on Multi-Core Systems

Journal of Signal Processing Systems
Timing effects of DDR memory systems in hard real-time multicore architectures: Issues and solutions

ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
CMP off-chip bandwidth scheduling guided by instruction criticality

Proceedings of the 27th international ACM conference on International conference on supercomputing
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.05

Visualization

Abstract

This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.