Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance

Authors:
David Brooks;Margaret Martonosi
Affiliations:
-;-
Venue:
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Year:
1999

Citing 0
Cited 81

Table size reduction for data value predictors by exploiting narrow width values

Proceedings of the 14th international conference on Supercomputing
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance

ACM Transactions on Computer Systems (TOCS)
Frequent value locality and value-centric data cache design

ACM SIGPLAN Notices
Very low power pipelines using significance compression

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A static power model for architects

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic zero compression for cache energy reduction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Frequent value compression in data caches

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Precision and error analysis of MATLAB applications during automated hardware synthesis for FPGAs

Proceedings of the conference on Design, automation and test in Europe
Frequent value locality and value-centric data cache design

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Energy reduction in queues and stacks by adaptive bitwidth compression

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Run-time power estimation in high performance microprocessors

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Energy: efficient instruction dispatch buffer design for superscalar processors

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
C Compiler Design for an Industrial Network Processor

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Hardware and Software Techniques for Controlling DRAM Power Modes

IEEE Transactions on Computers
Reducing power with dynamic critical path information

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
System and architecture-level power reduction of microprocessor-based communication and multi-media applications

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
Joint local and global hardware adaptations for energy

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Vectorizing Compiler for Multimedia Extensions

International Journal of Parallel Programming
HLSpower: Hybrid Statistical Modeling of the Superscalar Power-Performance Design Space

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Influence of Compiler Optimizations on Value Prediction

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Low-Cost Value Predictors Using Frequent Value Locality

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
BitValue Inference: Detecting and Exploiting Narrow Bitwidth Computations

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Energy-Efficient Design of the Reorder Buffer

PATMOS '02 Proceedings of the 12th International Workshop on Integrated Circuit Design. Power and Timing Modeling, Optimization and Simulation
Data Compression Transformations for Dynamically Allocated Data Structures

CC '02 Proceedings of the 11th International Conference on Compiler Construction
On Availability of Bit-Narrow Operations in General-Purpose Applications

FPL '00 Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications
Execution Latency Reduction via Variable Latency Pipeline and Instruction Reuse

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Quantifying behavioral differences between multimedia and general-purpose workloads

Journal of Systems Architecture: the EUROMICRO Journal
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Power-Aware Control Speculation through Selective Throttling

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Deterministic Clock Gating for Microprocessor Power Reduction

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Partial Resolution in Data Value Predictors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Energy-efficient issue queue design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on low power
Access Pattern Restructuring for Memory Energy

IEEE Transactions on Parallel and Distributed Systems
Software-Controlled Operand-Gating

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Speculative software management of datapath-width for energy optimization

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
A Formal Approach to Frequent Energy Adaptations for Multimedia Applications

Proceedings of the 31st annual international symposium on Computer architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
Dynamic Functional Unit Assignment for Low Power

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic functional unit assignment for low power

The Journal of Supercomputing
An Algorithm for Trading Off Quantization Error with Hardware Resources for MATLAB-Based FPGA Design

IEEE Transactions on Computers
A segmented parallel-prefix VLSI circuit with small delays for small segments

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
An asymmetric clustered processor based on value content

Proceedings of the 19th annual international conference on Supercomputing
Restrictive Compression Techniques to Increase Level 1 Cache Capacity

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Quality-driven design by bitwidth optimization for video applications

ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Profiling over Adaptive Ranges

Proceedings of the International Symposium on Code Generation and Optimization
A case for asymmetric-cell cache memories

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A case for a complexity-effective, width-partitioned microarchitecture

ACM Transactions on Architecture and Code Optimization (TACO)
Offline compression for on-chip ram

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Cross-component energy management: Joint adaptation of processor and memory

ACM Transactions on Architecture and Code Optimization (TACO)
Formulating and implementing profiling over adaptive ranges

ACM Transactions on Architecture and Code Optimization (TACO)
Instruction Reuse in SPEC, media and packet processing benchmarks: A comparative study of power, performance and related microarchitectural optimizations

Journal of Embedded Computing - Embeded Processors and Systems: Architectural Issues and Solutions for Emerging Applications
Asymmetrically banked value-aware register files for low-energy and high-performance

Microprocessors & Microsystems
Early detection and bypassing of trivial operations to improve energy efficiency of processors

Microprocessors & Microsystems
A Flexible Code Compression Scheme Using Partitioned Look-Up Tables

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Multiplication acceleration through twin precision

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Compiling for reconfigurable computing: A survey

ACM Computing Surveys (CSUR)
VAIL: variation-aware issue logic and performance binning for processor yield and profit improvement

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Exploiting narrow-width values for thermal-aware register file designs

Proceedings of the Conference on Design, Automation and Test in Europe
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Making a case for a green500 list

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Empowering a helper cluster through data-width aware instruction selection policies

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A high-speed, energy-efficient two-cycle multiply-accumulate (MAC) architecture and Its application to a double-throughput MAC unit

IEEE Transactions on Circuits and Systems Part I: Regular Papers - Special section on 2009 IEEE system-on-chip conference
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
DCG: deterministic clock-gating for low-power microprocessor design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
Global productiveness propagation: a code optimization technique to speculatively prune useless narrow computations

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
A framework for correction of multi-bit soft errors in L2 caches based on redundancy

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Value compression for efficient computation

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Power reduction of superscalar processor functional units by resizing adder-width

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Exploring the potential of architecture-level power optimizations

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
Bit-sliced datapath for energy-efficient high performance microprocessors

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Exploiting narrow values for energy efficiency in the register files of superscalar microprocessors

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Enhanced bitwidth-aware register allocation

CC'06 Proceedings of the 15th international conference on Compiler Construction
Exploiting narrow-width values for process variation-tolerant 3-D microprocessors

Proceedings of the 49th Annual Design Automation Conference
An asymmetric adaptive-precision energy-efficient 3DIC multiplier

Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Multispeculative additive trees in high-level synthesis

Proceedings of the Conference on Design, Automation and Test in Europe
Low power aging-aware register file design by duty cycle balancing

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.01

Visualization

Abstract

In general-purpose microprocessors, recent trends have pushed towards 64-bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64-bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced "multimedia" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU.This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these "narrow-bitwidth" instances. Both optimizations require little additional hardware, and neither requires compiler support.The first, power-oriented, optimization reduces processor power consumption by using aggressive clock gating to turn off portions of integer arithmetic units that will be unnecessary for narrow bitwidth operations. This optimization results in an over 50% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. The second optimization improves performance by merging together narrow integer operations and allowing them to share a single functional unit. Conceptually akin to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.