Designing a Modern Memory Hierarchy with Hardware Prefetching

Authors:
Wei-Fen Lin;Steven K. Reinhardt;Doug Burger
Affiliations:
Univ. of Michigan, Ann Arbor;Univ. of Michigan, Ann Arbor;Univ. of Texas, Austin, TX
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 28
Cited 9

Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Using virtual lines to enhance locality exploitation

ICS '94 Proceedings of the 8th international conference on Supercomputing
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The design and verification of the AlphaStation 600 5-series workstation

Digital Technical Journal - Special 10th anniversary issue
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Run-time adaptive cache hierarchy management via reference analysis

Proceedings of the 24th annual international symposium on Computer architecture
Exploiting spatial locality in data caches using spatial footprints

Proceedings of the 25th annual international symposium on Computer architecture
Investigating optimal local memory performance

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The performance impact of block sizes and fetch strategies

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Hardware-only stream prefetching and dynamic access ordering

Proceedings of the 14th international conference on Supercomputing
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Direct Rambus Technology: The New Main Memory Standard

IEEE Micro
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Access ordering and memory-conscious cache utilization

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Pursuing the Performance Potential of Dynamic Cache Line Sizes

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Filtering Superfluous Prefetches Using Density Vectors

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors

Characterization and Evaluation of Cache Hierarchies for Web Servers

World Wide Web
ChipLock: support for secure microarchitectures

ACM SIGARCH Computer Architecture News - Special issue: Workshop on architectural support for security and anti-virus (WASSA)
NPS: a non-interfering deployable web perfectching system

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Conjugate gradient sparse solvers: performance-power characteristics

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Exploring the prefetcher/memory controller design space: an opportunistic prefetch scheduling strategy

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Rank idle time prediction driven last-level cache writeback

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness

Quantified Score

Hi-index	14.98

Visualization

Abstract

In this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half its time stalling for L2 misses. Our experimental analysis begins with an effort to tune our baseline memory system aggressively: incorporating optimizations to reduce DRAM row buffer misses, reordering miss accesses to reduce queuing delay, and adjusting the L2 block size to match each channel organization. We show that there is a large gap between the block sizes at which performance is best and at which miss rate is minimized. Using those results, we evaluate a hardware prefetch unit integrated with the L2 cache and memory controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row buffer hits, and giving them low replacement priority, we achieve a 65 percent speedup across 10 of the 26 SPEC2000 benchmarks, without degrading the performance of the others. With eight Rambus channels, these 10 benchmarks improve to within 10 percent of the performance of a perfect L2 cache.