Improving the accuracy of dynamic branch prediction using branch correlation
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Increasing the instruction fetch rate via multiple branch prediction and a branch address cache
ICS '93 Proceedings of the 7th international conference on Supercomputing
Enhancing instruction scheduling with a block-structured ISA
International Journal of Parallel Programming
Next cache line and set prediction
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Optimization of instruction fetch mechanisms for high issue rates
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic path-based branch correlation
Proceedings of the 28th annual international symposium on Microarchitecture
Multiple-block ahead branch predictors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Increasing the instruction fetch rate via block-structured instruction set architectures
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Path-based next trace prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Alternative fetch and issue policies for the trace cache fetch mechanism
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The PowerPC 604 RISC microprocessor
IEEE Micro
Multiple Branch and Block Prediction
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Effects of Mispredicted-Path Execution on Branch Prediction Structures
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Completion time multiple branch prediction for enhancing trace cache performance
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
PipeRench implementation of the instruction path coprocessor
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Micro-operation cache: a power aware frontend for the variable instruction length ISA
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Power reduction through work reuse
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Performance characterization of a hardware mechanism for dynamic optimization
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Performance Evaluation of Exception Handling in I/O Libraries
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Selecting long atomic traces for high coverage
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Micro-operation cache: a power aware frontend for variable instruction length ISA
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on low power
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Execution cache-based microarchitecture power-efficient superscalar processors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines
Proceedings of the 32nd annual international symposium on Computer Architecture
The instruction register file micro-architecture
Future Generation Computer Systems - Special issue: Parallel computing technologies
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Block-aware instruction set architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Evaluating trace cache energy efficiency
ACM Transactions on Architecture and Code Optimization (TACO)
ACM Transactions on Computer Systems (TOCS)
The instruction register file micro-architecture
Future Generation Computer Systems - Special issue: Parallel computing technologies
International Journal of Modelling and Simulation
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Do trace cache, value prediction and prefetching improve SMT throughput?
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
Hi-index | 0.00 |
The trace cache is a recently proposed solution to achieving high instruction fetch bandwidth by buffering and reusing dynamic instruction traces. This work presents a new block-based trace cache implementation that can achieve higher IPC performance with more efficient storage of traces. Instead of explicitly storing instructions of a trace, pointers to blocks constituting a trace are stored in a much smaller trace table. The block-based trace cache renames fetch addresses at the basic block level and stores aligned blocks in a block cache. Traces are constructed by accessing the replicated block cache using block pointers from the trace table. Performance potential of the block-based trace cache is quantified and compared with perfect branch prediction and perfect fetch schemes. Comparing to the conventional trace cache, the block-based design can achieve higher IPC, with less impact on cycle time.Results: Using the SPECint95 benchmarks, a 16-wide realistic design of a block-based trace cache can improve performance 75% over a baseline design and to within 7% of a baseline design with perfect branch prediction. With idealized trace prediction, it is shown the block-based trace cache with an 1K-entry block cache achieves the same performance of the conventional trace cache with 32K entries.