Parallel algorithms for banded linear systems
SIAM Journal on Scientific and Statistical Computing
Interior point methods for optimal control of discrete time systems
Journal of Optimization Theory and Applications
FPGAs vs. CPUs: trends in peak floating-point performance
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An FPGA-Based Floating-Point Jacobi Iterative Solver
ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools)
MIMO Wireless Communications
FPGA implementation of the conjugate gradient method
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
An FPGA implementation of a sparse quadratic programming solver for constrained predictive control
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A fused hybrid floating-point and fixed-point dot-product for FPGAs
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Portable and scalable FPGA-based acceleration of a direct linear system solver
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Hi-index | 0.00 |
As Field Programmable Gate Arrays (FPGAs) have reached capacities beyond millions of equivalent gates, it becomes possible to accelerate floating-point scientific computing applications. One type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient algorithm. In this paper we present a parallel hardware Conjugate Gradient implementation. The implementation is particularly suited for accelerating multiple small to medium sized dense systems of linear equations. Through parallelization it is possible to convert the computation time per iteration for an order nmatrix from 茂戮驴(n2) cycles for a software implementation to 茂戮驴(n). I/O requirements are scalable and converge to a constant value with the increase of matrix order. Results on a VirtexII-6000 demonstrate sustained performance of 5 GFLOPS and projected results on a Virtex5-330 indicate sustained performance of 35 GFLOPS. The former result is comparable to high-end CPUs, whereas the latter represents a significant speedup.