High Bandwidth On-Chip Cache Design

Authors:
Kenneth M. Wilson;Kunle Olukotun
Affiliations:
Hewlett Packard Labs, Palo Alto, CA;Stanford Univ., Stanford, CA
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 19
Cited 4

High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance optimization of pipelined primary cache

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Tradeoffs in two-level on-chip caching

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Complexity/performance tradeoffs with non-blocking loads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor

Digital Technical Journal - Special 10th anniversary issue
Exploring configurations of functional units in an out-of-order superscalar processor

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A study on the number of memory ports in multiple instruction issue machines

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
High bandwidth cache design for superscalar processors

High bandwidth cache design for superscalar processors
Benchmark Handbook: For Database and Transaction Processing Systems

Benchmark Handbook: For Database and Transaction Processing Systems
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Performance Factors for Superscalar Processors

Performance Factors for Superscalar Processors
High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors

High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors

Speculative dynamic vectorization

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Control-Flow Independence Reuse via Dynamic Vectorization

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Speculative execution for hiding memory latency

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
A 3-D cache with ultra-wide data bus for 3-D processor-memory integration

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

In this paper, we evaluate the performance of high bandwidth cache organizations employing multiple cache ports, multiple cycle hit times, and cache port efficiency enhancements, such as load all and line buffer, to find the organization that provides the best processor performance. Using a dynamic superscalar processor running realistic benchmarks that include operating system references, we use execution time to measure processor performance. When the cache is limited to a single cache port without enhancements, we find that two cache ports increase processor performance by 25 percent. With the addition of line buffer and load all to a single ported cache, the processor achieves 91 percent of the performance of the same processor containing a cache with two ports. When the processor is not limited to a single cache port, the results show that a large dual-ported multicycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and using a line buffer provide the bandwidth needed by a dynamic superscalar processor. The line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency.