Hardware support for large atomic units in dynamically scheduled machines
MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Branch history table prediction of moving target branches due to subroutine returns
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
A comprehensive instruction fetch mechanism for a processor supporting speculative execution
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Improving CISC instruction decoding performance using a fill unit
Proceedings of the 28th annual international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
DAISY: dynamic compilation for 100% architectural compatibility
Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps
Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Trace Cache Microarchitecture and Evaluation
IEEE Transactions on Computers - Special issue on cache memory and related problems
A scalable front-end architecture for fast instruction delivery
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing the size of atomic instruction blocks using control flow assertions
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Micro-operation cache: a power aware frontend for the variable instruction length ISA
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
DELI: a new run-time control point
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Dynamic Branch Prediction with Perceptrons
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Power Awareness through Selective Dynamically Optimized Traces
Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 32nd annual international symposium on Computer Architecture
RENO: A Rename-Based Instruction Optimizer
Proceedings of the 32nd annual international symposium on Computer Architecture
Hi-index | 0.00 |
Fast instruction decoding is a challenge for the design of CISC microprocessors. A well-known solution to overcome this problem is using a trace cache. It stores and fetches already decoded instructions, avoiding the need for decoding them again. However, implementing a trace cache involves an important increase in the fetch architecture complexity.In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose buffer like the trace cache, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front-end to fetch already decoded instructions from memory instead of the original nondecoded instructions. Our results show that an 8-wide superscalar processor achieves an average 14% performance improvement by using our decoding architecture. This improvement is comparable to the one achieved by using the more complex trace cache, while requiring 16% less chip area and 21% less energy consumption in the fetch architecture.