Selecting long atomic traces for high coverage
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Power Awareness through Selective Dynamically Optimized Traces
Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Impact of technology scaling on energy aware execution cache-based microarchitectures
Proceedings of the 2004 international symposium on Low power electronics and design
Decode filter cache for energy efficient instruction cache hierarchy in super scalar architectures
Proceedings of the 2004 Asia and South Pacific Design Automation Conference
Execution cache-based microarchitecture power-efficient superscalar processors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines
Proceedings of the 32nd annual international symposium on Computer Architecture
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Managing bounded code caches in dynamic binary optimization systems
ACM Transactions on Architecture and Code Optimization (TACO)
ACM Transactions on Computer Systems (TOCS)
A predictive decode filter cache for reducing power consumption in embedded processors
ACM Transactions on Design Automation of Electronic Systems (TODAES)
IEEE Transactions on Computers
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Branch target buffer design for embedded processors
Microprocessors & Microsystems
PARROT: power awareness through selective dynamically optimized traces
PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
Hi-index | 0.00 |
Abstract: The trace cache is becoming an important building block of modern, wide-issue processors. So far, trace cache related research has been focused on increasing fetch bandwidth. Trace-caches have been shown too effectively increase the number of "useful" instructions that can be fetched into the machine, thus enabling more instructions to be executed each cycle. However, trace cache has another important benefit that got less attention in recent research: especially for variable length ISA, such as Intel's IA-32 architecture (X86), reducing instruction decoding power is particularly attractive. Keeping the instruction traces in decoded format, implies the decoding power is only paid upon the build of a trace, thus reducing the overall power consumption of the system. This paper has three main contributions: it indicates that trace cache optimizations directed to reducing power consumption arc do not necessarily coincide with optimizations directed to increasing fetch bandwidth; it extends our understanding on how well the trace cache utilizes its resources and introduces a new trace-cache organization based on filtering techniques. The knowledge obtained from the analysis of the traces' behavioral patterns motivates the use of filtering techniques. The new trace-cache organization increases the effective instruction-fetch bandwidth in conjunction with reducing the power consumption of the trace-cache system. We observe that (1) the majority of traces that are inserted into the trace-cache are rarely used again before being replaced; (2) the majority of the instructions delivered for execution originate from the fewer traces that are heavily and repeatedly used; and that (3) techniques that aim to improve instruction-fetch bandwidth may increase the number of traces built during program execution. Based on these observations, we propose splitting the trace cache into two components: the filter trace-cache (FTC) and the main trace-cache (MTC). Traces are first inserted into the FTC that is used to filter out the infrequently used traces; traces that prove "useful" are later moved into the MTC itself. The FTC/MTC organization exhibits an important benefit: it decreases the number of traces built, thus reducing power consumption while improving overall performance. For medium-size applications, the FTC/MTC pair reduces the number of trace builds by 16% in average. As extension of the filtering concept that involves adding a second level (L2) trace-cache that stores less frequent traces that are replaced in the FTC or the MTC. The extra level of caching allows for order-of-magnitude reduction in the number of trace builds. Second level trace cache proves particularly useful for applications with large instruction footprints.