FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing

Authors:
Yong Dou;Yuanwu Lei;Guiming Wu;Song Guo;Jie Zhou;Li Shen
Affiliations:
NUDT, Changsha, P.R. China;NUDT, Changsha, P.R. China;NUDT, Changsha, P.R. China;NUDT, Changsha, P.R. China;NUDT, Changsha, P.R. China;NUDT, Changsha, P.R. China
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 12
Cited 4

Algorithms for verified inclusions—theory and practice

Reliability in computing: the role of interval methods in scientific computing
The accuracy of floating point summation

SIAM Journal on Scientific Computing
Design, implementation and testing of extended and mixed precision BLAS

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
A Fast Radix-4 Division Algorithm and its Architecture

IEEE Transactions on Computers
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
High-Precision Floating-Point Arithmetic in Scientific Computation

Computing in Science and Engineering
MPFR: A multiple-precision binary floating-point library with correct rounding

ACM Transactions on Mathematical Software (TOMS)
High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware

IEEE Transactions on Computers
Accurate Floating-Point Summation Part I: Faithful Rounding

SIAM Journal on Scientific Computing
Accurate Floating-Point Summation Part II: Sign, $K$-Fold Faithful and Rounding to Nearest

SIAM Journal on Scientific Computing
A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs

FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines

FPGA implementation of variable-precision floating-point arithmetic

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Automatic parallelisation for LTI MIMO state space systems using FPGAs. An optimisation for cost & performance

Journal of Parallel and Distributed Computing
FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

The Journal of Supercomputing
VLIW coprocessor for IEEE-754 quadruple-precision elementary functions

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we explore the capability and flexibility of FPGA solutions in a sense to accelerate scientific computing applications which require very high precision arithmetic, based on 128-bit or even 256-bit floating-point number representation. This paper addresses the accuracy when performing LU decomposition on large-scale matrices. In future ExaScale computing environments, accuracy errors are expected to increase up to a level which leaves only 11 significant bits in the mantissa. This is caused by the required large amount of accumulation operations which are in the order of O(n3). Using exact long fixed-point numbers instead of usual floatingpoint numbers in the accumulation process, leads to exact accumulation results with only one bit error, originated by the rounding in the last normalization step. We have developed two types of High Precision Multiplication and Accumulation (HP-MAC), for Double-Double (128 bits) and Quad-Double (256 bits) floating-point, respectively, and implemented them into FPGA devices. We propose a two-level RAM banks scheme to store and add long fixed-point numbers with minimized crucial data paths lengths. We also introduce a scheme of partial summation to enhance the pipeline throughput of MAC operations, by dividing the summation function into 4 partial operations, processed in 4 banks. To prove the concept, we prototyped six 128-bit HP-MAC units into a Xilinx Virtex-5 XC5VLX330 FPGA chip and performed LU decomposition. The experimental results show accuracy improvement of 10 to 24 bits, compared to a software approach with similar precision arithmetic. Moreover, our LU decomposition implementation, based on FPGA running at 133MHz, achieves 29X--56X better performance and much lower power consumption compared to the use of a software-based library running on an Intel Core2 Quad Q8200 CPU at 2.33GHz.