Filtering Techniques to Improve Trace-Cache Efficiency

Authors:
Roni Rosner;Avi Mendelson;Ronny Ronen
Affiliations:
-;-;-
Venue:
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Year:
2001

Citing 0
Cited 16

Selecting long atomic traces for high coverage

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Power Awareness through Selective Dynamically Optimized Traces

Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Impact of technology scaling on energy aware execution cache-based microarchitectures

Proceedings of the 2004 international symposium on Low power electronics and design
Decode filter cache for energy efficient instruction cache hierarchy in super scalar architectures

Proceedings of the 2004 Asia and South Pacific Design Automation Conference
Execution cache-based microarchitecture power-efficient superscalar processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines

Proceedings of the 32nd annual international symposium on Computer Architecture
Trace Cache Sampling Filter

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Managing bounded code caches in dynamic binary optimization systems

ACM Transactions on Architecture and Code Optimization (TACO)
Trace cache sampling filter

ACM Transactions on Computer Systems (TOCS)
A predictive decode filter cache for reducing power consumption in embedded processors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Enlarging Instruction Streams

IEEE Transactions on Computers
Multiple stream prediction

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Branch target buffer design for embedded processors

Microprocessors & Microsystems
PARROT: power awareness through selective dynamically optimized traces

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: The trace cache is becoming an important building block of modern, wide-issue processors. So far, trace cache related research has been focused on increasing fetch bandwidth. Trace-caches have been shown too effectively increase the number of "useful" instructions that can be fetched into the machine, thus enabling more instructions to be executed each cycle. However, trace cache has another important benefit that got less attention in recent research: especially for variable length ISA, such as Intel's IA-32 architecture (X86), reducing instruction decoding power is particularly attractive. Keeping the instruction traces in decoded format, implies the decoding power is only paid upon the build of a trace, thus reducing the overall power consumption of the system. This paper has three main contributions: it indicates that trace cache optimizations directed to reducing power consumption arc do not necessarily coincide with optimizations directed to increasing fetch bandwidth; it extends our understanding on how well the trace cache utilizes its resources and introduces a new trace-cache organization based on filtering techniques. The knowledge obtained from the analysis of the traces' behavioral patterns motivates the use of filtering techniques. The new trace-cache organization increases the effective instruction-fetch bandwidth in conjunction with reducing the power consumption of the trace-cache system. We observe that (1) the majority of traces that are inserted into the trace-cache are rarely used again before being replaced; (2) the majority of the instructions delivered for execution originate from the fewer traces that are heavily and repeatedly used; and that (3) techniques that aim to improve instruction-fetch bandwidth may increase the number of traces built during program execution. Based on these observations, we propose splitting the trace cache into two components: the filter trace-cache (FTC) and the main trace-cache (MTC). Traces are first inserted into the FTC that is used to filter out the infrequently used traces; traces that prove "useful" are later moved into the MTC itself. The FTC/MTC organization exhibits an important benefit: it decreases the number of traces built, thus reducing power consumption while improving overall performance. For medium-size applications, the FTC/MTC pair reduces the number of trace builds by 16% in average. As extension of the filtering concept that involves adding a second level (L2) trace-cache that stores less frequent traces that are replaced in the FTC or the MTC. The extra level of caching allows for order-of-magnitude reduction in the number of trace builds. Second level trace cache proves particularly useful for applications with large instruction footprints.