FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

Authors:
Yuanwu Lei;Yong Dou;Yazhuo Dong;Jie Zhou;Fei Xia
Affiliations:
National Laboratory for Parallel & Distributed Processing, NUDT, Changsha, China;National Laboratory for Parallel & Distributed Processing, NUDT, Changsha, China;National Laboratory for Parallel & Distributed Processing, NUDT, Changsha, China;National Laboratory for Parallel & Distributed Processing, NUDT, Changsha, China;National Laboratory for Parallel & Distributed Processing, NUDT, Changsha, China
Venue:
The Journal of Supercomputing
Year:
2013

Citing 22
Cited 0

Algorithms for verified inclusions—theory and practice

Reliability in computing: the role of interval methods in scientific computing
A Family of Variable-Precision Interval Arithmetic Processors

IEEE Transactions on Computers
Design, implementation and testing of extended and mixed precision BLAS

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Using Accurate Arithmetics to Improve Numerical Reproducibility and Stability in Parallel Applications

The Journal of Supercomputing
A Hardware Algorithm for Variable-Precision Logarithm

ASAP '00 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Hardware Design and Arithmetic Algorithms for a Variable-Precision, Interval Arithmetic Coprocessor

ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
Interval Sine and Cosine Functions Computation Based on Variable-Precision CORDIC Algorithm

ARITH '99 Proceedings of the 14th IEEE Symposium on Computer Arithmetic
A Variable Long-Precision Arithmetic Unit Design for Reconfigurable Coprocessor Architectures

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
CORDIC Processor for Variable-Precision Interval Arithmetic

Journal of VLSI Signal Processing Systems
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
High-Precision Floating-Point Arithmetic in Scientific Computation

Computing in Science and Engineering
MPFR: A multiple-precision binary floating-point library with correct rounding

ACM Transactions on Mathematical Software (TOMS)
CADAC: A Controlled-Precision Decimal Arithmetic Unit

IEEE Transactions on Computers
IEEE Interval Standard Working Group - P1788: Current Status

ARITH '09 Proceedings of the 2009 19th IEEE Symposium on Computer Arithmetic
FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing

Proceedings of the 24th ACM International Conference on Supercomputing
A unified co-processor architecture for matrix decomposition

Journal of Computer Science and Technology
The exact dot product as basic tool for long interval arithmetic

Computing
Very fast and exact accumulation of products

Computing
Accurate floating point arithmetic through hardware error-free transformations

ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
A fused hybrid floating-point and fixed-point dot-product for FPGAs

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The current paper explores the capability and flexibility of field programmable gate-arrays (FPGAs) to implement variable-precision floating-point (VP) arithmetic. First, the VP exact dot product algorithm, which uses exact fixed-point operations to obtain an exact result, is presented. A VP multiplication and accumulation unit (VPMAC) on FPGA is then proposed. In the proposed design, the parallel multipliers generate the partial products of mantissa multiplication in parallel, which is the most time-consuming part in the VP multiplication and accumulation operation. This method fully utilizes DSP performance on FPGAs to enhance the performance of the VPMAC unit. Several other schemes, such as two-level RAM bank, carry-save accumulation, and partial summation, are used to achieve high frequency and pipeline throughput in the product accumulation stage. The typical algorithms in Basic Linear Algorithm Subprograms (i.e., vector dot product, general matrix vector product, and general matrix multiply product), LU decomposition, and Modified Gram---Schmidt QR decomposition, are used to evaluate the performance of the VPMAC unit. Two schemes, called the VPMAC coprocessor and matrix accelerator, are presented to implement these applications. Finally, prototypes of the VPMAC unit and the matrix accelerator based on the VPMAC unit are created on a Xilinx XC6VLX760 FPGA chip.Compared with a parallel software implementation based on OpenMP running on an Intel Xeon Quad-core E5620 CPU, the VPMAC coprocessor, equipped with one VPMAC unit, achieves a maximum acceleration factor of 18X. Moreover, the matrix accelerator, which mainly consists of a linear array of eight processing elements, achieves 12X---65X better performance.