Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure

Authors:
Oguz Ergin;Deniz Balkan;Kanad Ghose;Dmitry Ponomarev
Affiliations:
Intel Labs, UPC, Barcelona, Spain;State University of New York, Binghamton, NY;State University of New York, Binghamton, NY;State University of New York, Binghamton, NY
Venue:
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2004

Citing 37
Cited 23

Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Accurate static branch prediction by value range propagation

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Register renaming and dynamic speculation: an alternative approach

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Exploiting dead value information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A novel renaming scheme to exploit value temporal locality through physical register reuse and unification

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Delaying physical register allocation through virtual-physical registers

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Table size reduction for data value predictors by exploiting narrow width values

Proceedings of the 14th international conference on Supercomputing
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Very low power pipelines using significance compression

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic zero compression for cache energy reduction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A system-level energy minimization approach using datapath width optimization

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Bitwidth aware global register allocation

POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing register ports using delayed write-back queues and operand pre-fetch

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Virtual-Physical Registers

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Hardware Schemes for Early Register Release

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Profile-Based Dynamic Voltage Scheduling Using Program Checkpoints

Proceedings of the conference on Design, automation and test in Europe
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
In-Line Interrupt Handling for Software-Managed TLBs

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
A Scalable Register File Architecture for Dynamically Scheduled Processors

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Energy Efficient Asymmetrically Ported Register Files

ICCD '03 Proceedings of the 21st International Conference on Computer Design
Exploiting Value Locality in Physical Register Files

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Software-Controlled Operand-Gating

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Use-Based Register Caching with Decoupled Indexing

Proceedings of the 31st annual international symposium on Computer architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
Defining Wakeup Width for Efficient Dynamic Scheduling

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Increasing Processor Performance Through Early Register Release

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Dynamically reducing pressure on the physical register file through simple register sharing

ISPASS '04 Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software

Compiler Directed Early Register Release

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Early Register Deallocation Mechanisms Using Checkpointed Register Files

IEEE Transactions on Computers
A case for a complexity-effective, width-partitioned microarchitecture

ACM Transactions on Architecture and Code Optimization (TACO)
Selective writeback: exploiting transient values for energy-efficiency and performance

Proceedings of the 2006 international symposium on Low power electronics and design
Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping

Proceedings of the International Symposium on Code Generation and Optimization
Predicting and Exploiting Transient Values for Reducing Register File Pressure and Energy Consumption

IEEE Transactions on Computers
Asymmetrically banked value-aware register files for low-energy and high-performance

Microprocessors & Microsystems
Selective writeback: reducing register file pressure and energy consumption

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint allocation and release

ACM Transactions on Architecture and Code Optimization (TACO)
Exploring the limits of early register release: Exploiting compiler analysis

ACM Transactions on Architecture and Code Optimization (TACO)
Energy-efficient register caching with compiler assistance

ACM Transactions on Architecture and Code Optimization (TACO)
Exploiting narrow-width values for thermal-aware register file designs

Proceedings of the Conference on Design, Automation and Test in Europe
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Empowering a helper cluster through data-width aware instruction selection policies

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Global productiveness propagation: a code optimization technique to speculatively prune useless narrow computations

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Exploiting narrow values for energy efficiency in the register files of superscalar microprocessors

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Efficient and effective buffer overflow protection on ARM processors

WISTP'10 Proceedings of the 4th IFIP WG 11.2 international conference on Information Security Theory and Practices: security and Privacy of Pervasive Systems and Smart Devices
Exploiting narrow-width values for process variation-tolerant 3-D microprocessors

Proceedings of the 49th Annual Design Automation Conference
Improved bitwidth-aware variable packing

ACM Transactions on Architecture and Code Optimization (TACO)
Shared-port register file architecture for low-energy VLIW processors

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

A large percentage of computed results have fewer significant bits compared to the full width of a register. We exploit this fact to pack multiple results into a single physical register to reduce the pressure on the register file in a superscalar processor. Two schemes for dynamically packing multiple "narrow-width" results into partitions within a single register are evaluated. The first scheme is conservative and allocates a full-width register for a computed result. If the computed result turns out to be narrow, the result is reallocated to partitions within a common register, freeing up the full-width register. The second scheme allocates register partitions based on a prediction of the width of the result and reallocates register partitions when the actual result width is higher than what was predicted. If the actual width is narrower than what was predicted, allocated partitions are freed up. A detailed evaluation of our schemes show that average IPC gains of up to 15% can be realized across the SPEC 2000 benchmarks on a somewhat register-constrained datapath.