Optimization of instruction fetch mechanisms for high issue rates

Authors:
Thomas M. Conte;Kishore N. Menezes;Patrick M. Mills;Burzin A. Patel
Affiliations:
Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina;Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina;Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina;Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 13
Cited 54

Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The effects of processor architecture on instruction memory traffic

ACM Transactions on Computer Systems (TOCS)
Fast instruction cache performance evaluation using compile-time analysis

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using profile information to assist classic code optimizations

Software—Practice & Experience
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
IBM Power and PowerPC

IBM Power and PowerPC
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Using branch handling hardware to support profile-driven optimization

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors

Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors
Instruction scheduling for the Motorola 88110

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Architecture of the Pentium Microprocessor

IEEE Micro

High-bandwidth address translation for multiple-issue processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Multiple-block ahead branch predictors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Increasing the instruction fetch rate via block-structured instruction set architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The effect of instruction fetch bandwidth on value prediction

Proceedings of the 25th annual international symposium on Computer architecture
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors

IEEE Transactions on Parallel and Distributed Systems
Predictive techniques for aggressive load speculation

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
Evaluation of Design Options for the Trace Cache Fetch Mechanism

IEEE Transactions on Computers - Special issue on cache memory and related problems
MPS: Miss-Path Scheduling for Multiple-Issue Processors

IEEE Transactions on Computers
A dynamic scheduling logic for exploiting multiple functional units in single chip multithreaded architectures

Proceedings of the 1999 ACM symposium on Applied computing
Selective value prediction

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The block-based trace cache

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Adding a vector unit to a superscalar processor

ICS '99 Proceedings of the 13th international conference on Supercomputing
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Reducing cache misses using hardware and software page placement

ICS '99 Proceedings of the 13th international conference on Supercomputing
Classifying load and store instructions for memory renaming

ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Memory Renaming: Fast, Early and Accurate Processing of Memory Communication

International Journal of Parallel Programming
Completion time multiple branch prediction for enhancing trace cache performance

Proceedings of the 27th annual international symposium on Computer architecture
A hardware mechanism for dynamic extraction and relayout of program hot spots

Proceedings of the 27th annual international symposium on Computer architecture
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
A cost effective architecture for vectorizable numerical and multimedia applications

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Two cache lines prediction for a wide-issue micro-architecture

ACSAC '01 Proceedings of the 6th Australasian conference on Computer systems architecture
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures

International Journal of Parallel Programming
An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors

International Journal of Parallel Programming
Software Trace Cache for Commercial Applications

International Journal of Parallel Programming
Superspeculative Microarchitecture for Beyond AD 2000

Computer
Dynamically Reconfiguring Processor Resources to Reduce Power Consumption in High-Performance Processors

PACS '00 Proceedings of the First International Workshop on Power-Aware Computer Systems-Revised Papers
On the Performance of Fetch Engines Running DSS Workloads

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Selecting long atomic traces for high coverage

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Parallelism in the front-end

Proceedings of the 30th annual international symposium on Computer architecture
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Software Trace Cache

IEEE Transactions on Computers
The CSI multimedia architecture

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A cost-effective implementation of an ECC-protected instruction queue for out-of-order microprocessors

Proceedings of the 43rd annual Design Automation Conference
Wide and efficient trace prediction using the local trace predictor

Proceedings of the 20th annual international conference on Supercomputing
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
An enhanced DLX-based superscalar system simulator

WCAE-3 '97 Proceedings of the 1997 workshop on Computer architecture education
CATCH: a mechanism for dynamically detecting Cache-Content-Duplication and its application to instruction caches

Proceedings of the conference on Design, automation and test in Europe
The Design and Evaluation of a Selective Way Based Trace Cache

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Trace Cache Miss Rate

International Journal of Modelling and Simulation
Code alignment for architectures with pipeline group dispatching

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
CATCH: A mechanism for dynamically detecting cache-content-duplication in instruction caches

ACM Transactions on Architecture and Code Optimization (TACO)
Runtime adaptation: a case for reactive code alignment

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Reducing instruction fetch energy in multi-issue processors

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing buffer, achieves near-perfect performance and consistently aligns instructions in excess of 90% of the time, over a wide range of issue rates. The performance boost provided by compiler optimization techniques is also investigated. Results show that compiler optimization can significantly enhance performance across all schemes. The collapsing buffer supplemented by compiler techniques remains the best-performing mechanism. The paper closes with recommendations and suggestions for future.