Portable and scalable FPGA-based acceleration of a direct linear system solver

Authors:
Wei Zhang;Vaughn Betz;Jonathan Rose
Affiliations:
University of Toronto, ON, Canada;University of Toronto, ON, Canada;University of Toronto, ON, Canada
Venue:
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Year:
2012

Citing 13
Cited 0

Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
PAM-Blox: High Performance FPGA Design for Adaptive Computing

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Mapping of generalized template matching onto reconfigurable computers

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2001 international conference on computer design (ICCD)
Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Floating-point sparse matrix-vector multiply for FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Embedded floating-point units in FPGAs

Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Vforce: An Extensible Framework for Reconfigurable Supercomputing

Computer
Sparse Matrix Computations on Reconfigurable Hardware

Computer
High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware

IEEE Transactions on Computers
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

ARC '08 Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and Applications
High-Performance Mixed-Precision Linear Solver for FPGAs

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

FPGAs have the potential to serve as a platform for accelerating many computations including scientific applications. However, the large development cost and short life span for FPGA designs have limited their adoption by the scientific computing community. FPGA-based scientific computing and many kinds of embedded computing could become more practical if there were hardware libraries that were portable to any FPGA-based system with performance that scaled with the size of the FPGA. To illustrate this idea we have implemented one common super-computing library function: the LU factorization method for solving systems of linear equations. This paper describes a method for making the design both portable and scalable that should be illustrative if such libraries are to be built in the future. The design is a software-based generator that leverages both the flexibility of a software programming language and the parameters inherent in an hardware description language. The generator accepts parameters that describe the FPGA capacity and external memory capabilities. We compare the performance of our engine executing on the largest FPGA available at the time of this work (an Altera Stratix III 3S340) to a single processor core fabricated in the same 65nm IC process running a highly optimized software implementation from the processor vendor. For single precision matrices on the order of 10,000 × 10,000 elements, the FPGA implementation is 2.2 times faster and the energy dissipated per useful GFLOP operation is a factor of 5 times less. For double precision, the FPGA implementation is 1.7 times faster and 3.5 times more energy efficient.