Designing high bandwidth on-chip caches

Authors:
Kenneth M. Wilson;Kunle Olukotun
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
Proceedings of the 24th annual international symposium on Computer architecture
Year:
1997

Citing 34
Cited 14

Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance optimization of pipelined primary cache

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Tradeoffs in processor/memory interfaces for superscalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Cache write policies and performance

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Instruction-level parallel processing: history, overview, and perspective

The Journal of Supercomputing - Special issue on instruction-level parallelism
Tradeoffs in two-level on-chip caching

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Characterization of alpha AXP performance using TP and SPEC workloads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Complexity/performance tradeoffs with non-blocking loads

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A study of single-chip processor/cache organizations for large numbers of transistors

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A unified architectural tradeoff methodology

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Resource allocation in a high clock rate microprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Circuit implementation of a 300-MHz 64-bit second-generation CMOS Alpha CPU

Digital Technical Journal - Special 10th anniversary issue
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor

Digital Technical Journal - Special 10th anniversary issue
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
The memory wall and the CMOS end-point

ACM SIGARCH Computer Architecture News
The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Benchmark Handbook: For Database and Transaction Processing Systems

Benchmark Handbook: For Database and Transaction Processing Systems
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
Performance Features of the PA7100 Microprocessor

IEEE Micro
The Alpha AXP Architecture and 21064 Processor

IEEE Micro
Cache Performance of the SPEC92 Benchmark Suite

IEEE Micro
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Advanced performance features of the 64-bit PA-8000

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Performance Factors for Superscalar Processors

Performance Factors for Superscalar Processors
High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors

High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors

Exploiting dead value information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Decoupling local variable accesses in a wide-issue superscalar processor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
Hardware spatial forwarding for widely shared data

Proceedings of the 14th international conference on Supercomputing
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors

IEEE Transactions on Computers
High Bandwidth On-Chip Cache Design

IEEE Transactions on Computers
Cache decay: exploiting generational behavior to reduce cache leakage power

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Let caches decay: reducing leakage energy via exploitation of cache generational behavior

ACM Transactions on Computer Systems (TOCS)
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
On cache memory hierarchy for Chip-Multiprocessor

ACM SIGARCH Computer Architecture News
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

IEEE Transactions on Computers
A case for shared instruction cache on chip multiprocessors running OLTP

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Scalable cache memory design for large-scale SMT architectures

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Hierarchical memory system design for a heterogeneous multi-core processor

Proceedings of the 2008 ACM symposium on Applied computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization that provides the best processor performance. Processor performance is measured in execution time using a dynamic superscalar processor running realistic benchmarks that include operating system references. The results show that a large dual-ported multi-cycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and the use of a line buffer provide the bandwidth needed by a dynamic superscalar processor. In addition, the line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency.