Architectures and APIs: assessing requirements for delivering FPGA performance to applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs
IEEE Transactions on Parallel and Distributed Systems
Journal of Parallel and Distributed Computing
Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer with Application
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
3-D brain MRI tissue classification on FPGAs
IEEE Transactions on Image Processing
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Journal of Signal Processing Systems
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
A data-driven approach for executing the CG method on reconfigurable high-performance systems
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Hi-index | 0.00 |
Supercomputer companies such as Cray, Silicon Graphics, and SRC Computers now offer reconfigurable computer (RC) systems that combine general-purpose processors (GPPs) with field-programmable gate arrays (FPGAs). The FPGAs can be programmed to become, in effect, application-specific processors. These exciting supercomputers allow end-users to create custom computing architectures aimed at the computationally intensive parts of each problem. This report describes a parameterized, parallelized, deeply pipelined, dual-FPGA, IEEE-754 64-bit floating-point design for accelerating the conjugate gradient (CG) iterative method on an FPGA-augmented RC. The FPGA-based elements are developed via a hybrid approach that uses a high-level language (HLL)-to-hardware description language (HDL) compiler in conjunction with custombuilt, VHDL-based, floating-point components. A reference version of the design is implemented on a contemporary RC. Actual run time performance data compare the FPGAaugmented CG to the software-only version and show that the FPGA-based version runs 1.3 times faster than the software version. Estimates show that the design can achieve a 4 fold speedup on a next-generation RC.