Hardware support for large atomic units in dynamically scheduled machines
MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Machine organization of the IBM RISC System/6000 processor
IBM Journal of Research and Development
Branch history table prediction of moving target branches due to subroutine returns
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Improving the accuracy of dynamic branch prediction using branch correlation
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A comprehensive instruction fetch mechanism for a processor supporting speculative execution
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Increasing the instruction fetch rate via multiple branch prediction and a branch address cache
ICS '93 Proceedings of the 7th international conference on Supercomputing
A fill-unit approach to multiple instruction issue
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors
Optimization of instruction fetch mechanisms for high issue rates
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction fetching: coping with code bloat
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Control flow prediction with tree-like subgraphs for superscalar processors
Proceedings of the 28th annual international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Computer
A study of branch prediction strategies
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Proceedings of the 24th annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups
Proceedings of the 24th annual international symposium on Computer architecture
DAISY: dynamic compilation for 100% architectural compatibility
Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Alternative fetch and issue policies for the trace cache fetch mechanism
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
On high-bandwidth data cache design for multi-issue processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Speculative multithreaded processors
ICS '98 Proceedings of the 12th international conference on Supercomputing
ICS '98 Proceedings of the 12th international conference on Supercomputing
The effect of instruction fetch bandwidth on value prediction
Proceedings of the 25th annual international symposium on Computer architecture
Improving trace cache effectiveness with branch promotion and trace packing
Proceedings of the 25th annual international symposium on Computer architecture
Predictive techniques for aggressive load speculation
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Load latency tolerance in dynamically scheduled processors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
An empirical study of decentralized ILP execution models
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Functional Implementation Techniques for CPU Cache Memories
IEEE Transactions on Computers - Special issue on cache memory and related problems
A Trace Cache Microarchitecture and Evaluation
IEEE Transactions on Computers - Special issue on cache memory and related problems
Evaluation of Design Options for the Trace Cache Fetch Mechanism
IEEE Transactions on Computers - Special issue on cache memory and related problems
MPS: Miss-Path Scheduling for Multiple-Issue Processors
IEEE Transactions on Computers
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Decoupling local variable accesses in a wide-issue superscalar processor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable front-end architecture for fast instruction delivery
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Control Flow Prediction Schemes for Wide-Issue Superscalar Processors
IEEE Transactions on Parallel and Distributed Systems
Adding a vector unit to a superscalar processor
ICS '99 Proceedings of the 13th international conference on Supercomputing
ICS '99 Proceedings of the 13th international conference on Supercomputing
Clustered speculative multithreaded processors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Classifying load and store instructions for memory renaming
ICS '99 Proceedings of the 13th international conference on Supercomputing
A comparison of scalable superscalar processors
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
A Chip-Multiprocessor Architecture with Speculative Multithreading
IEEE Transactions on Computers
Control independence in trace processors
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Access region locality for high-bandwidth processor memory system design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
IEEE Transactions on Computers
Aggressive Dynamic Execution of Decoded Traces
Journal of VLSI Signal Processing Systems - Special issue on the 1997 IEEE workshop on signal processing systems (SiPS): design and implementation
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
Completion time multiple branch prediction for enhancing trace cache performance
Proceedings of the 27th annual international symposium on Computer architecture
A hardware mechanism for dynamic extraction and relayout of program hot spots
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
Early load address resolution via register tracking
Proceedings of the 27th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Software profiling for hot path prediction: less is more
ACM SIGPLAN Notices
Hardware support for dynamic activation of compiler-directed computation reuse
ACM SIGPLAN Notices
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
PipeRench implementation of the instruction path coprocessor
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing the size of atomic instruction blocks using control flow assertions
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Inherently Lower-Power High-Performance Superscalar Architectures
IEEE Transactions on Computers
Optimizations Enabled by a Decoupled Front-End Architecture
IEEE Transactions on Computers
A time-stamping algorithm for efficient performance estimation of superscalar processors
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A cost effective architecture for vectorizable numerical and multimedia applications
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Software profiling for hot path prediction: less is more
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Hardware support for dynamic activation of compiler-directed computation reuse
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Micro-operation cache: a power aware frontend for the variable instruction length ISA
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
A High-Bandwidth Memory Pipeline for Wide Issue Processors
IEEE Transactions on Computers
Boosting trace cache performance with nonhead miss speculation
ICS '02 Proceedings of the 16th international conference on Supercomputing
Dynamic speculative precomputation
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures
International Journal of Parallel Programming
The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors
International Journal of Parallel Programming
An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors
International Journal of Parallel Programming
Software Trace Cache for Commercial Applications
International Journal of Parallel Programming
On Augmenting Trace Cache for High-Bandwidth Value Prediction
IEEE Transactions on Computers
Multiscalar Execution along a Single Flow of Control
ICPP '97 Proceedings of the international Conference on Parallel Processing
Hierarchical Interconnects for On-Chip Clustering
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
The Case for Speculative Multithreading on SMT Processors
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Speculative Clustered Caches for Clustered Processors
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
On the Performance of Fetch Engines Running DSS Workloads
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
A Comparative Study of Redundancy in Trace Caches (Research Note)
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Secure Execution via Program Shepherding
Proceedings of the 11th USENIX Security Symposium
Performance Evaluation of Exception Handling in I/O Libraries
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
DELI: a new run-time control point
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic trace selection using performance monitoring hardware sampling
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Selecting long atomic traces for high coverage
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Enhancing memory level parallelism via recovery-free value prediction
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Ultrascalar Processor-An Asymptotically Scalable Superscalar Microarchitecture
ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
Catching Accurate Profiles in Hardware
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Design of Instruction Stream Buffer with Trace Support for X86 Processors
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Dynamic native optimization of interpreters
Proceedings of the 2003 workshop on Interpreters, virtual machines and emulators
Proceedings of the 30th annual international symposium on Computer architecture
Effective ahead pipelining of instruction block address generation
Proceedings of the 30th annual international symposium on Computer architecture
Improving dynamic cluster assignment for clustered trace cache processors
Proceedings of the 30th annual international symposium on Computer architecture
Aggressive Dynamic Execution of Multimedia Kernel Traces
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A Clustered Approach to Multithreaded Processors
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A trace-level value predictor for Contrail processors
ACM SIGARCH Computer Architecture News
Balancing Reuse Opportunities and Performance Gains with Subblock Value Reuse
IEEE Transactions on Computers
Hardware Support for Control Transfers in Code Caches
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Micro-operation cache: a power aware frontend for variable instruction length ISA
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on low power
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism
IEEE Transactions on Computers
Proceedings of the 1st conference on Computing frontiers
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Decode filter cache for energy efficient instruction cache hierarchy in super scalar architectures
Proceedings of the 2004 Asia and South Pacific Design Automation Conference
Cluster miss prediction with prefetch on miss for embedded CPU instruction caches
Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
IEEE Transactions on Computers
A Programmable Hardware Path Profiler
Proceedings of the international symposium on Code generation and optimization
Code placement for improving dynamic branch prediction accuracy
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Improving trace cache hit rates using the sliding window fill mechanism and fill select table
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Improving trace cache hit rates using the sliding window fill mechanism and fill select table
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction
IEEE Transactions on Computers
Energy-aware fetch mechanism: trace cache and BTB customization
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
On the performance of trace locality of reference
Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
The instruction register file micro-architecture
Future Generation Computer Systems - Special issue: Parallel computing technologies
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
IEEE Transactions on Parallel and Distributed Systems
Branch predictor guided instruction decoding
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Block-aware instruction set architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Improving instruction cache performance in OLTP
ACM Transactions on Database Systems (TODS)
A case study of multi-threading in the embedded space
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Wide and efficient trace prediction using the local trace predictor
Proceedings of the 20th annual international conference on Supercomputing
Evaluating trace cache energy efficiency
ACM Transactions on Architecture and Code Optimization (TACO)
ACM Transactions on Computer Systems (TOCS)
A predictive decode filter cache for reducing power consumption in embedded processors
ACM Transactions on Design Automation of Electronic Systems (TODAES)
On the power of simple branch prediction analysis
ASIACCS '07 Proceedings of the 2nd ACM symposium on Information, computer and communications security
A latency-conscious SMT branch prediction architecture
International Journal of High Performance Computing and Networking
Secretly monopolizing the CPU without superuser privileges
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Remote detection of virtual machine monitors with fuzzy benchmarking
ACM SIGOPS Operating Systems Review
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
The Design and Evaluation of a Selective Way Based Trace Cache
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
The instruction register file micro-architecture
Future Generation Computer Systems - Special issue: Parallel computing technologies
TAO: two-level atomicity for dynamic binary optimizations
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
HiPC'08 Proceedings of the 15th international conference on High performance computing
Reusing cached schedules in an out-of-order processor with in-order issue logic
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Dynamic branch prediction and control speculation
International Journal of High Performance Systems Architecture
An Adaptive Data Prefetcher for High-Performance Processors
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Microprocessors & Microsystems
Reducing memory space consumption through dataflow analysis
Computer Languages, Systems and Structures
Do trace cache, value prediction and prefetching improve SMT throughput?
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
RIMP: runtime implicit predication
APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Trace-Based runtime instruction rescheduling for architecture extension
ICESS'05 Proceedings of the Second international conference on Embedded Software and Systems
Energy-Effective instruction fetch unit for wide issue processors
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
MLP-Aware instruction queue resizing: the key to power-efficient performance
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Exploiting inactive rename slots for detecting soft errors
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Trace execution automata in dynamic binary translation
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Adaptive loop caching using lightweight runtime control flow analysis
ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Towards a multiple-ISA embedded system
Journal of Systems Architecture: the EUROMICRO Journal
ASC: automatically scalable computation
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Reducing instruction fetch energy in multi-issue processors
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.04 |
As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further, it is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.