Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance

Authors:
David Brooks;Margaret Martonosi
Affiliations:
Princeton Univ., Princeton, NJ;Princeton Univ., Princeton, NJ
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2000

Citing 28
Cited 16

Instruction issue logic for high-performance, interruptable pipelined processors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
A high-performance microarchitecture with hardware-programmable functional units

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Precomputation-based sequential logic optimization for low power

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special issue on low-power design
Circuit implementation of a 300-MHz 64-bit second-generation CMOS Alpha CPU

Digital Technical Journal - Special 10th anniversary issue
Microparallelism and high-performance protein matching

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Performance evaluation of the PowerPC 620 microarchitecture

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Alpha implementations and architecture: complete reference and guide

Alpha implementations and architecture: complete reference and guide
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Low power data processing by elimination of redundant computations

ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Power considerations in the design of the Alpha 21264 microprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Reducing power in high-performance microprocessors

DAC '98 Proceedings of the 35th annual Design Automation Conference
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Computer Architecture; A Quantitative Approach

Computer Architecture; A Quantitative Approach
The IA-64 Architecture at Work

Computer
The Metaflow Architecture

IEEE Micro
VIS Speeds New Media Processing

IEEE Micro
MMX Technology Extension to the Intel Architecture

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro
Power-Delay Characteristics of CMOS Multipliers

ARITH '97 Proceedings of the 13th Symposium on Computer Arithmetic (ARITH '97)
Advanced performance features of the 64-bit PA-8000

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Thermal Management System for High Performance PowerPCTM Microprocessors

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation

Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation
Transistor sizing for low power CMOS circuits

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Guarded evaluation: pushing power management to logic synthesis/design

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Managing static leakage energy in microprocessor functional units

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Clock and Power Gating with Timing Closure

IEEE Design & Test
Deterministic Clock Gating for Microprocessor Power Reduction

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Computer Architecture: Challenges and Opportunities for the Next Decade

IEEE Micro
Exploiting data-dependent slack using dynamic multi-VDD to minimize energy consumption in datapath circuits

Proceedings of the conference on Design, automation and test in Europe: Proceedings
A trace-based framework for verifiable GALS composition of IPs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Enabling power-efficient DVFS operations on silicon

IEEE Circuits and Systems Magazine
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Reducing functional unit power consumption and its variation using leakage sensors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
DCG: deterministic clock-gating for low-power microprocessor design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
Unified gated flip-flops for reducing the clocking power in register circuits

PATMOS'11 Proceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation
Power reduction of superscalar processor functional units by resizing adder-width

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
A technique to reduce static and dynamic power of functional units in high-performance processors

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Energy reduction by systematic run-time reconfigurable hardware deactivation

Transactions on High-Performance Embedded Architectures and Compilers IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

The large address space needs of many current applications have pushed processor designs toward 64-bit word widths. Although full 64-bit addresses and operations are indeed sometimes needed, arithmetic operations on much smaller quantities are still more common. In fact, another instruction set trend has been the introduction of instructions geared toward subword operations on 16-bit quantities. For examples, most major processors now include instruction set support for multimedia operations allowing parallel execution of several subword operations in the same ALU. This article presents our observations demonstrating that operations on “narrow-width” quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations. The first, power-oriented optimization reduces processor power consumption by using operand-value-based clock gating to turn off portions of arithmetic units that will be unused by narrow-width operations. This optimization results in a 45%-60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%-10% full-chip power savings. Our second, performance-oriented optimization improves processor performance by packing together narrow-width operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench. Overall, these optimizations highlight an increasing opportunity for value-based optimizations to improve both power and performance in current microprocessors.