Trace cache: a low latency approach to high bandwidth instruction fetching

Authors:
Eric Rotenberg;Steve Bennett;James E. Smith
Affiliations:
Computer Science Dept., Univ. of Wisconsin - Madison;Intel Corporation;Dept. of Elec. and Comp. Engr., Univ. of Wisconsin - Madison
Venue:
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Year:
1996

Citing 15
Cited 149

Hardware support for large atomic units in dynamically scheduled machines

MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Machine organization of the IBM RISC System/6000 processor

IBM Journal of Research and Development
Branch history table prediction of moving target branches due to subroutine returns

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Improving the accuracy of dynamic branch prediction using branch correlation

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A comprehensive instruction fetch mechanism for a processor supporting speculative execution

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Increasing the instruction fetch rate via multiple branch prediction and a branch address cache

ICS '93 Proceedings of the 7th international conference on Supercomputing
A fill-unit approach to multiple instruction issue

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors

Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Control flow prediction with tree-like subgraphs for superscalar processors

Proceedings of the 28th annual international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Efficient program tracing

Computer
A study of branch prediction strategies

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture

Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups

Proceedings of the 24th annual international symposium on Computer architecture
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Alternative fetch and issue policies for the trace cache fetch mechanism

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
On high-bandwidth data cache design for multi-issue processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Speculative multithreaded processors

ICS '98 Proceedings of the 12th international conference on Supercomputing
Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor

ICS '98 Proceedings of the 12th international conference on Supercomputing
The effect of instruction fetch bandwidth on value prediction

Proceedings of the 25th annual international symposium on Computer architecture
Improving trace cache effectiveness with branch promotion and trace packing

Proceedings of the 25th annual international symposium on Computer architecture
Predictive techniques for aggressive load speculation

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Load latency tolerance in dynamically scheduled processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
An empirical study of decentralized ILP execution models

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Functional Implementation Techniques for CPU Cache Memories

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
Evaluation of Design Options for the Trace Cache Fetch Mechanism

IEEE Transactions on Computers - Special issue on cache memory and related problems
MPS: Miss-Path Scheduling for Multiple-Issue Processors

IEEE Transactions on Computers
Selective value prediction

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Decoupling local variable accesses in a wide-issue superscalar processor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
The block-based trace cache

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Whole program paths

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Control Flow Prediction Schemes for Wide-Issue Superscalar Processors

IEEE Transactions on Parallel and Distributed Systems
Adding a vector unit to a superscalar processor

ICS '99 Proceedings of the 13th international conference on Supercomputing
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Classifying load and store instructions for memory renaming

ICS '99 Proceedings of the 13th international conference on Supercomputing
A comparison of scalable superscalar processors

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
Control independence in trace processors

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Access region locality for high-bandwidth processor memory system design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
Aggressive Dynamic Execution of Decoded Traces

Journal of VLSI Signal Processing Systems - Special issue on the 1997 IEEE workshop on signal processing systems (SiPS): design and implementation
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Trace preconstruction

Proceedings of the 27th annual international symposium on Computer architecture
Completion time multiple branch prediction for enhancing trace cache performance

Proceedings of the 27th annual international symposium on Computer architecture
A hardware mechanism for dynamic extraction and relayout of program hot spots

Proceedings of the 27th annual international symposium on Computer architecture
Instruction path coprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Software profiling for hot path prediction: less is more

ACM SIGPLAN Notices
Hardware support for dynamic activation of compiler-directed computation reuse

ACM SIGPLAN Notices
The impact of delay on the design of branch predictors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
PipeRench implementation of the instruction path coprocessor

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Inherently Lower-Power High-Performance Superscalar Architectures

IEEE Transactions on Computers
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
A time-stamping algorithm for efficient performance estimation of superscalar processors

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A cost effective architecture for vectorizable numerical and multimedia applications

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Software profiling for hot path prediction: less is more

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Hardware support for dynamic activation of compiler-directed computation reuse

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Micro-operation cache: a power aware frontend for the variable instruction length ISA

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
A High-Bandwidth Memory Pipeline for Wide Issue Processors

IEEE Transactions on Computers
Boosting trace cache performance with nonhead miss speculation

ICS '02 Proceedings of the 16th international conference on Supercomputing
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures

International Journal of Parallel Programming
The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors

International Journal of Parallel Programming
An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors

International Journal of Parallel Programming
Software Trace Cache for Commercial Applications

International Journal of Parallel Programming
One Billion Transistors, One Uniprocessor, One Chip

Computer
Superspeculative Microarchitecture for Beyond AD 2000

Computer
Trace Processors: Moving to Fourth-Generation Microarchitectures

Computer
Calibration of Microprocessor Performance Models

Computer
Using Paths to Measure, Explain, and Enhance Program Behavior

Computer
Guest Editors' Introduction: Early 21st Century Processors

Computer
On Augmenting Trace Cache for High-Bandwidth Value Prediction

IEEE Transactions on Computers
Multiscalar Execution along a Single Flow of Control

ICPP '97 Proceedings of the international Conference on Parallel Processing
Hierarchical Interconnects for On-Chip Clustering

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Case for Speculative Multithreading on SMT Processors

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Speculative Clustered Caches for Clustered Processors

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
On the Performance of Fetch Engines Running DSS Workloads

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
A Comparative Study of Redundancy in Trace Caches (Research Note)

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Secure Execution via Program Shepherding

Proceedings of the 11th USENIX Security Symposium
Performance Evaluation of Exception Handling in I/O Libraries

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
DELI: a new run-time control point

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic trace selection using performance monitoring hardware sampling

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Selecting long atomic traces for high coverage

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Ultrascalar Processor-An Asymptotically Scalable Superscalar Microarchitecture

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
Catching Accurate Profiles in Hardware

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Design of Instruction Stream Buffer with Trace Support for X86 Processors

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Dynamic native optimization of interpreters

Proceedings of the 2003 workshop on Interpreters, virtual machines and emulators
Parallelism in the front-end

Proceedings of the 30th annual international symposium on Computer architecture
Effective ahead pipelining of instruction block address generation

Proceedings of the 30th annual international symposium on Computer architecture
Improving dynamic cluster assignment for clustered trace cache processors

Proceedings of the 30th annual international symposium on Computer architecture
Aggressive Dynamic Execution of Multimedia Kernel Traces

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A Clustered Approach to Multithreaded Processors

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A trace-level value predictor for Contrail processors

ACM SIGARCH Computer Architecture News
Balancing Reuse Opportunities and Performance Gains with Subblock Value Reuse

IEEE Transactions on Computers
Hardware Support for Control Transfers in Code Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Micro-operation cache: a power aware frontend for variable instruction length ISA

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on low power
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism

IEEE Transactions on Computers
BLOB computing

Proceedings of the 1st conference on Computing frontiers
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Decode filter cache for energy efficient instruction cache hierarchy in super scalar architectures

Proceedings of the 2004 Asia and South Pacific Design Automation Conference
Cluster miss prediction with prefetch on miss for embedded CPU instruction caches

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Software Trace Cache

IEEE Transactions on Computers
A Programmable Hardware Path Profiler

Proceedings of the international symposium on Code generation and optimization
Code placement for improving dynamic branch prediction accuracy

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Improving trace cache hit rates using the sliding window fill mechanism and fill select table

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Improving trace cache hit rates using the sliding window fill mechanism and fill select table

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction

IEEE Transactions on Computers
Energy-aware fetch mechanism: trace cache and BTB customization

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
On the performance of trace locality of reference

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
The instruction register file micro-architecture

Future Generation Computer Systems - Special issue: Parallel computing technologies
Trace Cache Sampling Filter

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors

IEEE Transactions on Parallel and Distributed Systems
Branch predictor guided instruction decoding

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Block-aware instruction set architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Improving instruction cache performance in OLTP

ACM Transactions on Database Systems (TODS)
A case study of multi-threading in the embedded space

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Wide and efficient trace prediction using the local trace predictor

Proceedings of the 20th annual international conference on Supercomputing
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
Trace cache sampling filter

ACM Transactions on Computer Systems (TOCS)
A predictive decode filter cache for reducing power consumption in embedded processors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
On the power of simple branch prediction analysis

ASIACCS '07 Proceedings of the 2nd ACM symposium on Information, computer and communications security
A latency-conscious SMT branch prediction architecture

International Journal of High Performance Computing and Networking
Secretly monopolizing the CPU without superuser privileges

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Remote detection of virtual machine monitors with fuzzy benchmarking

ACM SIGOPS Operating Systems Review
Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
The Design and Evaluation of a Selective Way Based Trace Cache

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
The instruction register file micro-architecture

Future Generation Computer Systems - Special issue: Parallel computing technologies
TAO: two-level atomicity for dynamic binary optimizations

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Scalable multi-cores with improved per-core performance using off-the-critical path reconfigurable hardware

HiPC'08 Proceedings of the 15th international conference on High performance computing
Reusing cached schedules in an out-of-order processor with in-order issue logic

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Dynamic branch prediction and control speculation

International Journal of High Performance Systems Architecture
An Adaptive Data Prefetcher for High-Performance Processors

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Towards minimizing execution delays on dynamically reconfigurable processors: a case study on REDEFINE

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Full Length Article: An on-chip instruction cache design with one-bit tag for low-power embedded systems

Microprocessors & Microsystems
Reducing memory space consumption through dataflow analysis

Computer Languages, Systems and Structures
Do trace cache, value prediction and prefetching improve SMT throughput?

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
RIMP: runtime implicit predication

APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Trace-Based runtime instruction rescheduling for architecture extension

ICESS'05 Proceedings of the Second international conference on Embedded Software and Systems
Energy-Effective instruction fetch unit for wide issue processors

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
MLP-Aware instruction queue resizing: the key to power-efficient performance

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Exploiting inactive rename slots for detecting soft errors

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Trace execution automata in dynamic binary translation

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing
Adaptive loop caching using lightweight runtime control flow analysis

ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Towards a multiple-ISA embedded system

Journal of Systems Architecture: the EUROMICRO Journal
ASC: automatically scalable computation

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Reducing instruction fetch energy in multi-issue processors

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.04

Visualization

Abstract

As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further, it is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.