Algorithms for verified inclusions—theory and practice
Reliability in computing: the role of interval methods in scientific computing
A Family of Variable-Precision Interval Arithmetic Processors
IEEE Transactions on Computers
Design, implementation and testing of extended and mixed precision BLAS
ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
The Journal of Supercomputing
A Hardware Algorithm for Variable-Precision Logarithm
ASAP '00 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Hardware Design and Arithmetic Algorithms for a Variable-Precision, Interval Arithmetic Coprocessor
ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
Interval Sine and Cosine Functions Computation Based on Variable-Precision CORDIC Algorithm
ARITH '99 Proceedings of the 14th IEEE Symposium on Computer Arithmetic
A Variable Long-Precision Arithmetic Unit Design for Reconfigurable Coprocessor Architectures
FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
FPGAs vs. CPUs: trends in peak floating-point performance
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
CORDIC Processor for Variable-Precision Interval Arithmetic
Journal of VLSI Signal Processing Systems
64-bit floating-point FPGA matrix multiplication
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
High-Precision Floating-Point Arithmetic in Scientific Computation
Computing in Science and Engineering
MPFR: A multiple-precision binary floating-point library with correct rounding
ACM Transactions on Mathematical Software (TOMS)
CADAC: A Controlled-Precision Decimal Arithmetic Unit
IEEE Transactions on Computers
IEEE Interval Standard Working Group - P1788: Current Status
ARITH '09 Proceedings of the 2009 19th IEEE Symposium on Computer Arithmetic
Proceedings of the 24th ACM International Conference on Supercomputing
A unified co-processor architecture for matrix decomposition
Journal of Computer Science and Technology
Accurate floating point arithmetic through hardware error-free transformations
ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
A fused hybrid floating-point and fixed-point dot-product for FPGAs
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Hi-index | 0.00 |
The current paper explores the capability and flexibility of field programmable gate-arrays (FPGAs) to implement variable-precision floating-point (VP) arithmetic. First, the VP exact dot product algorithm, which uses exact fixed-point operations to obtain an exact result, is presented. A VP multiplication and accumulation unit (VPMAC) on FPGA is then proposed. In the proposed design, the parallel multipliers generate the partial products of mantissa multiplication in parallel, which is the most time-consuming part in the VP multiplication and accumulation operation. This method fully utilizes DSP performance on FPGAs to enhance the performance of the VPMAC unit. Several other schemes, such as two-level RAM bank, carry-save accumulation, and partial summation, are used to achieve high frequency and pipeline throughput in the product accumulation stage. The typical algorithms in Basic Linear Algorithm Subprograms (i.e., vector dot product, general matrix vector product, and general matrix multiply product), LU decomposition, and Modified Gram---Schmidt QR decomposition, are used to evaluate the performance of the VPMAC unit. Two schemes, called the VPMAC coprocessor and matrix accelerator, are presented to implement these applications. Finally, prototypes of the VPMAC unit and the matrix accelerator based on the VPMAC unit are created on a Xilinx XC6VLX760 FPGA chip.Compared with a parallel software implementation based on OpenMP running on an Intel Xeon Quad-core E5620 CPU, the VPMAC coprocessor, equipped with one VPMAC unit, achieves a maximum acceleration factor of 18X. Moreover, the matrix accelerator, which mainly consists of a linear array of eight processing elements, achieves 12X---65X better performance.