A comprehensive instruction fetch mechanism for a processor supporting speculative execution
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Fast and accurate instruction fetch and branch prediction
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tradeoffs in two-level on-chip caching
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors
Reducing branch costs via branch alignment
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Optimization of instruction fetch mechanisms for high issue rates
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction fetching: coping with code bloat
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiple-block ahead branch predictors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Integrating a misprediction recovery cache (MRC) into a superscalar pipeline
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Increasing the instruction fetch rate via block-structured instruction set architectures
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Silicon trends and limits for advanced microprocessors
Communications of the ACM
Improving prediction for procedure returns with return-address-stack repair mechanisms
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Branch Target Buffer Design and Optimization
IEEE Transactions on Computers
A Scaling Scheme and Optimization Methodology for Deep Sub-Micron Interconnect
ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
The Effects of Mispredicted-Path Execution on Branch Prediction Structures
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Limits of Scaling MOSFETs
Fetch directed instruction prefetching
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Improving BTB performance in the presence of DLLs
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Architectural and compiler support for effective instruction prefetching: a cooperative approach
ACM Transactions on Computer Systems (TOCS)
Optimizations Enabled by a Decoupled Front-End Architecture
IEEE Transactions on Computers
Latency and energy aware value prediction for high-frequency processors
ICS '02 Proceedings of the 16th international conference on Supercomputing
On Augmenting Trace Cache for High-Bandwidth Value Prediction
IEEE Transactions on Computers
Instruction Level Distributed Processing
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Instruction Level Distributed Processing: Adapting to Future Technology
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
High Performance and Energy Efficient Serial Prefetch Architecture
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Improving the Performance of Heterogeneous DSMs via Multithreading
VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Speeding Up Target Address Generation Using a Self-indexed FTB (Research Note)
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Instruction fetch deferral using static slack
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitecture evaluation with physical planning
Proceedings of the 40th annual Design Automation Conference
A Trace Based Evaluation of Speculative Branch Decoupling
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Effective ahead pipelining of instruction block address generation
Proceedings of the 30th annual international symposium on Computer architecture
Prophet/Critic Hybrid Branch Prediction
Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Using a serial cache for energy efficient instruction fetching
Journal of Systems Architecture: the EUROMICRO Journal
The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture
IEEE Transactions on Parallel and Distributed Systems
Effective Instruction Prefetching via Fetch Prestaging
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Analysis of the O-GEometric History Length Branch Predictor
Proceedings of the 32nd annual international symposium on Computer Architecture
The instruction register file micro-architecture
Future Generation Computer Systems - Special issue: Parallel computing technologies
Reducing the Latency and Area Cost of Core Swapping through Shared Helper Engines
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Dynamically configurable shared CMP helper engines for improved performance
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Simultaneously improving code size, performance, and energy in embedded processors
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Branch predictor guided instruction decoding
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Block-aware instruction set architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Improving the performance and power efficiency of shared helpers in CMPs
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Evaluating trace cache energy efficiency
ACM Transactions on Architecture and Code Optimization (TACO)
Unified microprocessor core storage
Proceedings of the 4th international conference on Computing frontiers
IEEE Transactions on Computers
A latency-conscious SMT branch prediction architecture
International Journal of High Performance Computing and Networking
The instruction register file micro-architecture
Future Generation Computer Systems - Special issue: Parallel computing technologies
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Low-overhead core swapping for thermal management
PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Hi-index | 0.01 |
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the instruction delivery mechanism scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. To further complicate matters, a VLSI interconnect scaling trend is materializing that further limits the performance of front-end designs in future generation process technologies.To counter these challenges, we present a fetch architecture that permits a faster cycle time than previous designs and scales better with future process technologies. Our design, called the Fetch Target Buffer, is a multi-level fetch block-oriented predictor. We decouple the FTB from the instruction fetch and decode pipelines to afford it the fastest clock possible. Through cycle-based simulation and circuit-level delay analysis, we find that our multi-level FTB design is capable of delivering instructions 25% faster than the best single-level BTB-based pipeline configuration. Moreover, we show that our design scales better to future process technologies than traditional single-level designs.