Instruction issue logic for high-performance, interruptable pipelined processors
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
A high-performance microarchitecture with hardware-programmable functional units
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Precomputation-based sequential logic optimization for low power
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special issue on low-power design
Circuit implementation of a 300-MHz 64-bit second-generation CMOS Alpha CPU
Digital Technical Journal - Special 10th anniversary issue
Microparallelism and high-performance protein matching
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Performance evaluation of the PowerPC 620 microarchitecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Alpha implementations and architecture: complete reference and guide
Alpha implementations and architecture: complete reference and guide
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Low power data processing by elimination of redundant computations
ISLPED '97 Proceedings of the 1997 international symposium on Low power electronics and design
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Power considerations in the design of the Alpha 21264 microprocessor
DAC '98 Proceedings of the 35th annual Design Automation Conference
Reducing power in high-performance microprocessors
DAC '98 Proceedings of the 35th annual Design Automation Conference
IEEE Transactions on Computers
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
Computer Architecture; A Quantitative Approach
Computer Architecture; A Quantitative Approach
The IA-64 Architecture at Work
Computer
IEEE Micro
VIS Speeds New Media Processing
IEEE Micro
Subword Parallelism with MAX-2
IEEE Micro
Power-Delay Characteristics of CMOS Multipliers
ARITH '97 Proceedings of the 13th Symposium on Computer Arithmetic (ARITH '97)
Advanced performance features of the 64-bit PA-8000
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Thermal Management System for High Performance PowerPCTM Microprocessors
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation
Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation
Transistor sizing for low power CMOS circuits
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Guarded evaluation: pushing power management to logic synthesis/design
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Orion: a power-performance simulator for interconnection networks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Managing static leakage energy in microprocessor functional units
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploiting data-width locality to increase superscalar execution bandwidth
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Clock and Power Gating with Timing Closure
IEEE Design & Test
Deterministic Clock Gating for Microprocessor Power Reduction
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Proceedings of the conference on Design, automation and test in Europe: Proceedings
A trace-based framework for verifiable GALS composition of IPs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Enabling power-efficient DVFS operations on silicon
IEEE Circuits and Systems Magazine
Characterization and exploitation of narrow-width loads: the narrow-width cache approach
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Reducing functional unit power consumption and its variation using leakage sensors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
DCG: deterministic clock-gating for low-power microprocessor design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
Unified gated flip-flops for reducing the clocking power in register circuits
PATMOS'11 Proceedings of the 21st international conference on Integrated circuit and system design: power and timing modeling, optimization, and simulation
Power reduction of superscalar processor functional units by resizing adder-width
PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
A technique to reduce static and dynamic power of functional units in high-performance processors
PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Energy reduction by systematic run-time reconfigurable hardware deactivation
Transactions on High-Performance Embedded Architectures and Compilers IV
Hi-index | 0.00 |
The large address space needs of many current applications have pushed processor designs toward 64-bit word widths. Although full 64-bit addresses and operations are indeed sometimes needed, arithmetic operations on much smaller quantities are still more common. In fact, another instruction set trend has been the introduction of instructions geared toward subword operations on 16-bit quantities. For examples, most major processors now include instruction set support for multimedia operations allowing parallel execution of several subword operations in the same ALU. This article presents our observations demonstrating that operations on “narrow-width” quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations. The first, power-oriented optimization reduces processor power consumption by using operand-value-based clock gating to turn off portions of arithmetic units that will be unused by narrow-width operations. This optimization results in a 45%-60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%-10% full-chip power savings. Our second, performance-oriented optimization improves processor performance by packing together narrow-width operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench. Overall, these optimizations highlight an increasing opportunity for value-based optimizations to improve both power and performance in current microprocessors.