ScaLAPACK user's guide
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Floating-point sparse matrix-vector multiply for FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
64-bit floating-point FPGA matrix multiplication
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Assessing the potential of hybrid hpc systems for scientific applications: a case study
Proceedings of the 4th international conference on Computing frontiers
Architecture for dense matrix multiplication on a high-performance reconfigurable system
Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the Dunes
Hi-index | 0.00 |
Recently, reconfigurable computing systems have been built which employ Field-Programmable Gate Arrays (FPGAs) as hardware accelerators for general-purpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectively utilize both the FPGAs and processors in the reconfigurable computing systems. Based on a high-level computational model, we propose designs for floating-point matrix multiplication and block LU decomposition. In our designs, the workload of an application is partitioned between the FPGAs and processors in a balanced way; the FPGAs and processors work cooperatively without data hazards or memory access conflicts. Experimental results on Cray XD1 show that with one Xilinx XC2VP50 FPGA (a relatively small device available in XD1) and an AMD 2.2 GHz processor, our designs achieve up to 1.4X/2X speedup over the design that employs AMD processors/FPGAs only. The performance of our designs scales with the number of nodes. Moreover, our designs achieve higher performance when improved floating-point units or larger devices are used.