On high-bandwidth data cache design for multi-issue processors

Authors:
Jude A. Rivers;Gary S. Tyson;Edward S. Davidson;Todd M. Austin
Affiliations:
Advanced Computer Architecture Laboratory, The University of Michigan;Advanced Computer Architecture Laboratory, The University of Michigan;Advanced Computer Architecture Laboratory, The University of Michigan;Microcomputer Research Labs, Intel Corporation
Venue:
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Year:
1997

Citing 14
Cited 34

On the effective bandwidth of interleaved memories in vector processor systems

IEEE Transactions on Computers
Hardware support for large atomic units in dynamically scheduled machines

MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Pseudo-randomly interleaved memory

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Distributed storage control unit for the Hitachi S-3800 multivector supercomputer

ICS '94 Proceedings of the 8th international conference on Supercomputing
A fill-unit approach to multiple instruction issue

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
High-bandwidth address translation for multiple-issue processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data caches for superscalar processors

ICS '97 Proceedings of the 11th international conference on Supercomputing
One Billion Transistors, One Uniprocessor, One Chip

Computer
Advanced performance features of the 64-bit PA-8000

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture

Evaluation of high performance multicache parallel texture mapping

ICS '98 Proceedings of the 12th international conference on Supercomputing
Speculation techniques for improving load related instruction scheduling

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Decoupling local variable accesses in a wide-issue superscalar processor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Adding a vector unit to a superscalar processor

ICS '99 Proceedings of the 13th international conference on Supercomputing
Instruction fetch mechanisms for multipath execution processors

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Access region locality for high-bandwidth processor memory system design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A cost effective architecture for vectorizable numerical and multimedia applications

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
A High-Bandwidth Memory Pipeline for Wide Issue Processors

IEEE Transactions on Computers
Execution history guided instruction prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
Speculative dynamic vectorization

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Execution History Guided Instruction Prefetching

The Journal of Supercomputing
Control-Flow Independence Reuse via Dynamic Vectorization

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Scalable cache memory design for large-scale SMT architectures

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
On the effectiveness of prefetching and reuse in reducing L1 data cache traffic: a case study of Snort

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
The TM3270 Media-Processor Data Cache

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
The TM3270 Media-Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Speculative execution for hiding memory latency

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Exploiting the replication cache to improve performance for multiple-issue microprocessors

ACM SIGARCH Computer Architecture News - Special issue: MEDEA 2004 workshop
Exploiting the replication cache to improve cache read bandwidth cost effectively

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Investigating cache energy and latency break-even points in high performance processors

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
I-cache multi-banking and vertical interleaving

Proceedings of the 17th ACM Great Lakes symposium on VLSI
Unified microprocessor core storage

Proceedings of the 4th international conference on Computing frontiers
Investigating cache energy and latency break-even points in high performance processors

ACM SIGARCH Computer Architecture News
Parallel Memory Architecture for Application-Specific Instruction-Set Processors

Journal of Signal Processing Systems
Access region cache with register guided memory reference partitioning

Journal of Systems Architecture: the EUROMICRO Journal
Parallel memory architecture for TTA processor

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
The bandwidth expansion effectiveness of cache levels block prefetch

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Dynamic partition of memory reference instructions – a register guided approach

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Hot-and-Cold: using criticality in the design of energy-efficient caches

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
APC: a performance metric of memory systems

ACM SIGMETRICS Performance Evaluation Review
Virtually split cache: An efficient mechanism to distribute instructions and data

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Highly aggressive multi-issue processor designs of the past few years and projections for the decade, require that we redesign the operation of the cache memory system. The number of instructions that must be processed (including incorrectly predicted ones) will approach 16 or more per cycle. Since memory operations account for about a third of all instructions executed, these systems will have to support multiple data references per cycle. In this paper, we explore reference stream characteristics to determine how best to meet the need for ever increasing access rates. We identify limitations of existing multi-ported cache designs and propose a new structure, the Locally-Based Interleaved Cache (LBIC), to exploit the characteristics of the data reference stream while approaching the economy of traditional multi-bank cache design. Experimental results show that the LBIC structure is capable of outper forming current multi-ported approaches.