Fetching instruction streams

Authors:
Alex Ramirez;Oliverio J. Santana;Josep L. Larriba-Pey;Mateo Valero
Affiliations:
Universitat Politecnica de Catalunya;Universitat Politecnica de Catalunya;Universitat Politecnica de Catalunya;Universitat Politecnica de Catalunya
Venue:
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Year:
2002

Citing 32
Cited 16

Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A comprehensive instruction fetch mechanism for a processor supporting speculative execution

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Branch prediction for free

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Increasing the instruction fetch rate via multiple branch prediction and a branch address cache

ICS '93 Proceedings of the 7th international conference on Supercomputing
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Reducing branch costs via branch alignment

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Control flow prediction with tree-like subgraphs for superscalar processors

Proceedings of the 28th annual international symposium on Microarchitecture
Multiple-block ahead branch predictors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Alternative fetch and issue policies for the trace cache fetch mechanism

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Procedure placement using temporal ordering information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving trace cache effectiveness with branch promotion and trace packing

Proceedings of the 25th annual international symposium on Computer architecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Optimizing instruction cache performance for operating system intensive workloads

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Control Flow Speculation in Multiscalar Processors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Effect of Code Reordering on Branch Prediction

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Dynamic Branch Prediction with Perceptrons

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Spike: an optimizer for alpha/NT executables

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997

Parallelism in the front-end

Proceedings of the 30th annual international symposium on Computer architecture
Prophet/Critic Hybrid Branch Prediction

Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Software Trace Cache

IEEE Transactions on Computers
Effective Instruction Prefetching via Fetch Prestaging

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Energy-aware fetch mechanism: trace cache and BTB customization

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Block-aware instruction set architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Wide and efficient trace prediction using the local trace predictor

Proceedings of the 20th annual international conference on Supercomputing
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
Enlarging Instruction Streams

IEEE Transactions on Computers
A latency-conscious SMT branch prediction architecture

International Journal of High Performance Computing and Networking
Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Multiple stream prediction

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Proactive instruction fetch

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

Fetch performance is a very important factor because it effectively limits the overall processor performance. However, there is little performance advantage in increasing front-end performance beyond what the back-end can consume. For each processor design, the target is to build the best possible fetch engine for the required performance level A fetch engine will be better if it provides better performance, but also if it takes fewer resources, requires less chip area, or consumes less power.In this paper we propose a novel fetch architecture based on the execution of long streams of sequential instructions, taking maximum advantage of code layout optimizations. We describe our architecture in detail, and show that it requires less complexity and resources than other high performance fetch architectures like the trace cache, while providing a high fetch performance suitable for wide-issue super-scalar processors.Our results show that using our fetch architecture and code layout optimizations obtains 10% higher performance than the EV8 fetch architecture, and 4% higher than the FTB architecture using state-of-the-art branch predictors, while being only 1.5% slower than the trace cache. Even in the absence of code layout optimizations, fetching instruction streams is still 10% faster than the EV8, and only 4% slower than the trace cache.Fetching instruction streams effectively exploits the special characteristics of layout optimized codes to provide a high fetch performance, close to that of a trace cache, but has a much lower cost and complexity, similar to that of a basic block architecture.