A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

Authors:
Antonio Roldao;George A. Constantinides
Affiliations:
Imperial College London;Imperial College London
Venue:
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Year:
2010

Citing 15
Cited 3

Parallel algorithms for banded linear systems

SIAM Journal on Scientific and Statistical Computing
Interior point methods for optimal control of discrete time systems

Journal of Optimization Theory and Applications
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Floating-point sparse matrix-vector multiply for FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An FPGA-Based Floating-Point Jacobi Iterative Solver

ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools)

The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools)
MIMO Wireless Communications

MIMO Wireless Communications
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
Multiobjective Optimization of FPGA-Based Medical Image Registration

FCCM '08 Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines
FPGA implementation of the conjugate gradient method

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics

The Krawczyk algorithm: rigorous bounds for linear equation solution on an FPGA

ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
Architectural support for multithreading on reconfigurable hardware

ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
Multithreading on reconfigurable hardware: An architectural approach

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.