Power Awareness through Selective Dynamically Optimized Traces

Authors:
Roni Rosner;Yoav Almog;Micha Moffie;Naftali Schwartz;Avi Mendelson
Affiliations:
Intel Labs, Haifa, Israel;Intel Labs, Haifa, Israel;Intel Labs, Haifa, Israel;Intel Labs, Haifa, Israel;Intel Labs, Haifa, Israel
Venue:
Proceedings of the 31st annual international symposium on Computer architecture
Year:
2004

Citing 24
Cited 21

Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The expandable split window paradigm for exploiting fine-grain parallelsim

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhancing instruction scheduling with a block-structured ISA

International Journal of Parallel Programming
Exploiting instruction level parallelism in processors by caching scheduled groups

Proceedings of the 24th annual international symposium on Computer architecture
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Micro-operation cache: a power aware frontend for the variable instruction length ISA

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
rePLay: A Hardware Framework for Dynamic Optimization

IEEE Transactions on Computers
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Dynamic and Transparent Binary Translation

Computer
Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors

IEEE Micro
Filtering Techniques to Improve Trace-Cache Efficiency

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Selecting long atomic traces for high coverage

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamic Optimization of Micro-Operations

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Performance and Hardware Complexity Tradeoffs in Designing Multithreaded Architectures

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
VLIW Scheduling for Energy and Performance

WVLSI '01 Proceedings of the IEEE Computer Society Workshop on VLSI 2001
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization

Low-power, low-complexity instruction issue using compiler assistance

Proceedings of the 19th annual international conference on Supercomputing
An Event-Driven Multithreaded Dynamic Optimization Framework

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Continuous Path and Edge Profiling

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Startup Time in Co-Designed Virtual Machines

Proceedings of the 33rd annual international symposium on Computer Architecture
Branch predictor guided instruction decoding

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Wide and efficient trace prediction using the local trace predictor

Proceedings of the 20th annual international conference on Supercomputing
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
Trace cache sampling filter

ACM Transactions on Computer Systems (TOCS)
Interactive presentation: Generating and executing multi-exit custom instructions for an adaptive extensible processor

Proceedings of the conference on Design, automation and test in Europe
Enlarging Instruction Streams

IEEE Transactions on Computers
Predictor virtualization

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
An architecture framework for an adaptive extensible processor

The Journal of Supercomputing
TAO: two-level atomicity for dynamic binary optimizations

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion

Proceedings of the 8th ACM International Conference on Computing Frontiers
Trace-Based runtime instruction rescheduling for architecture extension

ICESS'05 Proceedings of the Second international conference on Embedded Software and Systems
DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
LAR-CC: Large atomic regions with conditional commits

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
BlockChop: dynamic squash elimination for hybrid processor architecture

Proceedings of the 39th Annual International Symposium on Computer Architecture
TSO_ATOMICITY: efficient hardware primitive for TSO-preserving region optimizations

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Speculative hardware/software co-designed floating-point multiply-add fusion

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the PARROT concept that seeks to achievehigher performance with reduced energy consumptionthrough gradual optimization of frequently executed codetraces. The PARROT microarchitectural framework integratestrace caching, dynamic optimizations and pipelinedecoupling. We employ a selective approach for applyingcomplex mechanisms only upon the most frequently usedtraces to maximize the performance gain at any givenpower constraint, thus attaining finer control of tradeoffsbetween performance and power awareness.We show that the PARROT based microarchitecture canimprove the performance of aggressively designed processorsby providing the means to improve the utilizationof their more elaborate resources. At the same time, rigorousselection of traces prior to storage and optimizationprovides the key to attenuating increases in thepower budget.For resource-constrained designs, PARROT based architecturesdeliver better performance (up to an average16% increase in IPC) at a comparable energy level,whereas the conventional path to a similar performanceimprovement consumes an average 70% more energy.Meanwhile, for those designs which can tolerate a higherpower budget, PARROT gracefully scales up to use additionalexecution resources in a uniformly efficient manner.In particular, a PARROT-style doubly-wide machinedelivers an average 45% IPC improvement while actuallyimproving the cubic-MIPS-per-WATT power awarenessmetric by over 50%.