Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Authors:
Zhao Zhang;Zhichun Zhu;Xiaodong Zhang
Affiliations:
Dept. of Electr. & Comput. Eng., Iowa State Univ., Ames, IA, USA;-;-
Venue:
IEEE Transactions on Computers
Year:
2004

Citing 37
Cited 5

On the inclusion properties for multi-level cache hierarchies

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Alternative implementations of two-level adaptive branch prediction

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Performance of cached DRAM organizations in vector supercomputers

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Decoupled sectored caches: conciliating low tag implementation cost

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
DCD—disk caching disk: a new approach for boosting I/O performance

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The design and analysis of a cache architecture for texture mapping

Proceedings of the 24th annual international symposium on Computer architecture
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Multi-level texture caching for 3D graphics hardware

Proceedings of the 25th annual international symposium on Computer architecture
Functional Implementation Techniques for CPU Cache Memories

IEEE Transactions on Computers - Special issue on cache memory and related problems
Speculation techniques for improving load related instruction scheduling

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Eager writeback - a technique for improving bandwidth utilization

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamically allocating processor resources between nearby and distant ILP

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The architecture of the DIVA processing-in-memory chip

ICS '02 Proceedings of the 16th international conference on Supercomputing
Bloom filtering cache misses for accurate data speculation and prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
The Cache DRAM Architecture: A DRAM with an On-Chip Cache Memory

IEEE Micro
A Case for Intelligent RAM

IEEE Micro
The IA-64 Itanium Processor Cartridge

IEEE Micro
Pinnacle: IBM MXT in a Memory Controller Chip

IEEE Micro
Cached DRAM for ILP Processor Memory Access Latency Reduction

IEEE Micro
A Decoupled Predictor-Directed Stream Prefetching Architecture

IEEE Transactions on Computers
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A study of instruction cache organizations and replacement policies

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Active Memory: Micron's Yukon

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance of Hardware Compressed Main Memory

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Fine-grain Priority Scheduling on Multi-channel Memory Systems

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
A Single Chip Multiprocessor Integrated with High Density DRAM

A Single Chip Multiprocessor Integrated with High Density DRAM
A Media-Enhanced Vector Architecture for Embedded Memory Systems

A Media-Enhanced Vector Architecture for Embedded Memory Systems

A freespace crossbar for multi-core processors

Proceedings of the 22nd annual international conference on Supercomputing
FILESPPA: Fast Instruction Level Embedded System Power and Performance Analyzer

Microprocessors & Microsystems
Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	14.98

Visualization

Abstract

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited size due to the low density and high cost of SRAM and, thus, cannot hold the working sets of many memory-intensive applications. Second, since the tag checking overhead of large caches is nontrivial, the existence of L3 caches increases the cache miss penalty and may even harm the performance of some memory-intensive applications. To address these two issues, we present a new memory hierarchy design that uses cached DRAM to construct a large size and low overhead off-chip cache. The high density DRAM portion in the cached DRAM can hold large working sets, while the small SRAM portion exploits the spatial locality appearing in L2 miss streams to reduce the access latency. The L3 tag array is placed off-chip with the data array, minimizing the area overhead on the processor for L3 cache, while a small tag cache is placed on-chip, effectively removing the off-chip tag access overhead. A prediction technique accurately predicts the hit/miss status of an access to the cached DRAM, further reducing the access latency. Conducting execution-driven simulations for a 2GHz 4-way issue processor and with 11 memory-intensive programs from the SPEC 2000 benchmark, we show that a system with a cached DRAM of 64MB DRAM and 128KB on-chip SRAM cache as the off-chip cache outperforms the same system with an 8MB SRAM L3 off-chip cache by up to 78 percent measured by the total execution time. The average speedup of the system with the cached-DRAM off-chip cache is 25 percent over the system with the L3 SRAM cache.