Energy-Efficient Floating-Point Unit Design

Authors:
Sameh Galal;Mark Horowitz
Affiliations:
Stanford University, Stanford;Stanford University, Stanford
Venue:
IEEE Transactions on Computers
Year:
2011

Citing 0
Cited 7

Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
Neural Acceleration for General-Purpose Approximate Programs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

ACM Transactions on Architecture and Code Optimization (TACO)
EVA: an efficient vision architecture for mobile systems

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

Energy-efficient computation is critical if we are going to continue to scale performance in power-limited systems. For floating-point applications that have large amounts of data parallelism, one should optimize the {\rm throughput/mm}^{2} given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. Looking at FP multiply-add units and ignoring register and memory overheads, we find that in a 90 nm CMOS technology at 1 {\rm W/mm}^{2}, one can achieve a performance of {\rm 27 GFlops/mm}^{2} single precision, and {\rm 7.5 GFlops/mm}^{2} double precision. Adding register file overheads reduces the throughput by less than 50 percent if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall FP architecture to a lower energy/performance point. A 1 {\rm W}/{\rm mm}^{2} design at 90 nm is a “high-energy” design, so scaling it to a lower energy design in 45 nm still yields a 7\times performance gain, while a more balanced 0.1 {\rm W/mm}^{2} design only speeds up by 3.5{\times} when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only {\sim} 3{\times} for both power densities when scaling to a 22 nm technology.