Ultra-low-power adder stage design for exascale floating point units

Authors:
Alberto A. Del Barrio;Nader Bagherzadeh;Román Hermida
Affiliations:
Complutense University of Madrid;Center for Pervasive Communication and Computing, University of California at Irvine;Complutense University of Madrid
Venue:
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Year:
2014

Citing 24
Cited 0

Design of the IBM RISC System/6000 floating-point execution unit

IBM Journal of Research and Development
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The floating-point unit of the PowerPC 603e microprocessor

IBM Journal of Research and Development
Comparison of Single- and Dual-Pass Multiply-Add Fused Floating-Point Units

IEEE Transactions on Computers
Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques

IEEE Transactions on Computers
Computer Arithmetic Algorithms

Computer Arithmetic Algorithms
Implementing Streaming SIMD Extensions on the Pentium III Processor

IEEE Micro
Leading Zero Anticipation and Detection A Comparison of Methods

ARITH '01 Proceedings of the 15th IEEE Symposium on Computer Arithmetic
Speeding Up Processing with Approximation Circuits

Computer
Floating-Point Fused Multiply-Add: Reduced Latency for Floating-Point Addition

ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Variable latency speculative addition: a new paradigm for arithmetic circuit design

Proceedings of the conference on Design, automation and test in Europe
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support

IEEE Transactions on Computers
Hybrid LZA: a near optimal implementation of the leading zero anticipator

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
Scaling with Design Constraints: Predicting the Future of Big Chips

IEEE Micro
Benchmarking modern multiprocessors

Benchmarking modern multiprocessors
Characteristics of workloads using the pipeline programming model

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support

IEEE Transactions on Computers
Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism

IEEE Micro
The IBM Blue Gene/Q Compute Chip

IEEE Micro
The Challenges of Petascale Architectures

Computing in Science and Engineering
Floating-point multiply-add-fused with reduced latency

IEEE Transactions on Computers
Big Iron Moves Toward Exascale Computing

Computer

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently, the most powerful supercomputers can provide tens of petaflops. Future many-core systems are estimated to provide an exaflop. However, the power budget limitation makes these machines still unfeasible and unaffordable. Floating Point Units (FPUs) are critical from both the power consumption and performance points of view of today's microprocessors and supercomputers. Literature offers very different designs. Some of them are focused on increasing performance no matter the penalty, and others on decreasing power at the expense of lower performance. In this article, we propose a novel approach for reducing the power of the FPU without degrading the rest of parameters. Concretely, this power reduction is also accompanied by an area reduction and a performance improvement. Hence, an overall energy gain will be produced. According to our experiments, our proposed unit consumes 17.5%, 23% and 16.5% less energy for single, double and quadruple precision, with an additional 15%, 21.5% and 14.5% delay reduction, respectively. Furthermore, area is also diminished by 4%, 4.5 and 5%.