A scalable front-end architecture for fast instruction delivery

Authors:
Glenn Reinman;Todd Austin;Brad Calder
Affiliations:
Department of Computer Science and Engineering, University of California, San Diego;Microcomputer Research Labs, Intel Corporation;Department of Computer Science and Engineering, University of California, San Diego
Venue:
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Year:
1999

Citing 20
Cited 40

A comprehensive instruction fetch mechanism for a processor supporting speculative execution

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tradeoffs in two-level on-chip caching

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors

Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors
Reducing branch costs via branch alignment

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiple-block ahead branch predictors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Integrating a misprediction recovery cache (MRC) into a superscalar pipeline

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Increasing the instruction fetch rate via block-structured instruction set architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps

Proceedings of the 24th annual international symposium on Computer architecture
Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Silicon trends and limits for advanced microprocessors

Communications of the ACM
Improving prediction for procedure returns with return-address-stack repair mechanisms

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Branch Target Buffer Design and Optimization

IEEE Transactions on Computers
A Scaling Scheme and Optimization Methodology for Deep Sub-Micron Interconnect

ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
The Effects of Mispredicted-Path Execution on Branch Prediction Structures

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Limits of Scaling MOSFETs

Limits of Scaling MOSFETs

Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
The impact of delay on the design of branch predictors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Improving BTB performance in the presence of DLLs

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Architectural and compiler support for effective instruction prefetching: a cooperative approach

ACM Transactions on Computer Systems (TOCS)
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
Latency and energy aware value prediction for high-frequency processors

ICS '02 Proceedings of the 16th international conference on Supercomputing
Instruction-Level Distributed Processing

Computer
On Augmenting Trace Cache for High-Bandwidth Value Prediction

IEEE Transactions on Computers
Instruction Level Distributed Processing

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Instruction Level Distributed Processing: Adapting to Future Technology

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
High Performance and Energy Efficient Serial Prefetch Architecture

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Improving the Performance of Heterogeneous DSMs via Multithreading

VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Speeding Up Target Address Generation Using a Self-indexed FTB (Research Note)

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Instruction fetch deferral using static slack

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitecture evaluation with physical planning

Proceedings of the 40th annual Design Automation Conference
A Trace Based Evaluation of Speculative Branch Decoupling

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Effective ahead pipelining of instruction block address generation

Proceedings of the 30th annual international symposium on Computer architecture
Prophet/Critic Hybrid Branch Prediction

Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Using a serial cache for energy efficient instruction fetching

Journal of Systems Architecture: the EUROMICRO Journal
The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture

IEEE Transactions on Parallel and Distributed Systems
Better Branch Prediction Through Prophet/Critic Hybrids

IEEE Micro
Effective Instruction Prefetching via Fetch Prestaging

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Analysis of the O-GEometric History Length Branch Predictor

Proceedings of the 32nd annual international symposium on Computer Architecture
The instruction register file micro-architecture

Future Generation Computer Systems - Special issue: Parallel computing technologies
Reducing the Latency and Area Cost of Core Swapping through Shared Helper Engines

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Dynamically configurable shared CMP helper engines for improved performance

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Simultaneously improving code size, performance, and energy in embedded processors

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Branch predictor guided instruction decoding

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Block-aware instruction set architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Improving the performance and power efficiency of shared helpers in CMPs

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
Unified microprocessor core storage

Proceedings of the 4th international conference on Computing frontiers
Enlarging Instruction Streams

IEEE Transactions on Computers
A latency-conscious SMT branch prediction architecture

International Journal of High Performance Computing and Networking
The instruction register file micro-architecture

Future Generation Computer Systems - Special issue: Parallel computing technologies
Multiple stream prediction

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Low-overhead core swapping for thermal management

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the instruction delivery mechanism scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. To further complicate matters, a VLSI interconnect scaling trend is materializing that further limits the performance of front-end designs in future generation process technologies.To counter these challenges, we present a fetch architecture that permits a faster cycle time than previous designs and scales better with future process technologies. Our design, called the Fetch Target Buffer, is a multi-level fetch block-oriented predictor. We decouple the FTB from the instruction fetch and decode pipelines to afford it the fastest clock possible. Through cycle-based simulation and circuit-level delay analysis, we find that our multi-level FTB design is capable of delivering instructions 25% faster than the best single-level BTB-based pipeline configuration. Moreover, we show that our design scales better to future process technologies than traditional single-level designs.