A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

  • Authors:
  • Antonio Roldao;George A. Constantinides

  • Affiliations:
  • Imperial College London;Imperial College London

  • Venue:
  • ACM Transactions on Reconfigurable Technology and Systems (TRETS)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.