Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques

Authors:
Zhen Luo;Margaret Martonosi
Affiliations:
Princeton Univ., Princeton, NJ;Princeton Univ., Princeton, NJ
Venue:
IEEE Transactions on Computers
Year:
2000

Citing 8
Cited 12

Principles of CMOS VLSI design: a systems perspective

Principles of CMOS VLSI design: a systems perspective
Alpha implementations and architecture: complete reference and guide

Alpha implementations and architecture: complete reference and guide
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
167 MHz Radix-4 Floating Point Multiplier

ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Quantitative analysis of floating point arithmetic on FPGA based custom computing machines

FCCM '95 Proceedings of the IEEE Symposium on FPGA's for Custom Computing Machines
Linear and nonlinear conjugate gradient methods for adaptive processing

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 03

A faster distributed arithmetic architecture for FPGAs

FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Automating Customisation of Floating-Point Designs

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
Floating-point divider design for FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An integrated reduction technique for a double precision accumulator

Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Self-Alignment Schemes for the Implementation of Addition-Related Floating-Point Operators

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Floating-Point Exponentiation Units for Reconfigurable Computing

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A fully automated reconfigurable calculation engine dedicated to the real-time simulation of high switching frequency power electronic circuits

Mathematics and Computers in Simulation
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Ultra-low-power adder stage design for exascale floating point units

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	14.98

Visualization

Abstract

The speed of arithmetic calculations in configurable hardware is limited by carry propagation, even with the dedicated hardware found in recent FPGAs. This paper proposes and evaluates an approach called delayed addition that reduces the carry-propagation bottleneck and improves the performance of arithmetic calculations. Our approach employs the idea used in Wallace trees to store the results in an intermediate form and delay addition until the end of a repeated calculation such as accumulation or dot-product; this effectively removes carry propagation overhead from the calculation's critical path. We present both integer and floating-point designs that use our technique. Our pipelined integer multiply-accumulate (MAC) design is based on a fairly traditional multiplier design, but with delayed addition as well. This design achieves a 72MHz clock rate on an XC4036xla-9 FPGA and 170MHz clock rate on an XV300epq240-8 FPGA. Next, we present a 32-bit floating-point accumulator based on delayed addition. Here, delayed addition requires a novel alignment technique that decouples the incoming operands from the accumulated result. A conservative version of this design achieves a 40 MHz clock rate on an XC4036xla-9 FPGA and 97MHz clock rate on an XV100epq240-8 FPGA. We also present a 32-bit floating-point accumulator design with compiler-managed overflow avoidance that achieves a 80MHz clock rate on an XC4036xla-9 FPGA and 150MHz clock rate on an XCV100epq240-8 FPGA.