State-of-the-art in heterogeneous computing
Scientific Programming
Proceedings of the 24th ACM International Conference on Supercomputing
Resource-constrained multiprocessor synthesis for floating-point applications on FPGAs
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Peak performance model for a custom precision floating-point dot product on FPGAs
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Portable and scalable FPGA-based acceleration of a direct linear system solver
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Formal approach for the development of intelligent industrial control components
International Journal of Computer Applications in Technology
A reconstruction method for electrical capacitance tomography based on image fusion techniques
Digital Signal Processing
Self-Alignment Schemes for the Implementation of Addition-Related Floating-Point Operators
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Floating-Point Exponentiation Units for Reconfigurable Computing
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Performance modeling of pipelined linear algebra architectures on FPGAs
ARC'13 Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications
C2FPGA-A dependency-timing graph design methodology
Journal of Parallel and Distributed Computing
Scalable matrix decompositions with multiple cores on FPGAs
Microprocessors & Microsystems
Journal of Real-Time Image Processing
Hi-index | 14.99 |
Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (Field Programmable Gate Arrays) has become feasible. In this paper, we propose FPGA-based designs for several basic linear algebra operations, including dot product, matrix-vector multiplication, matrix multiplication and matrix factorization. By identifying the parameters for each operation, we analyze the trade-offs and propose a high-performance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx Virtex-II Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of general-purpose processor based designs. We also show that with faster floating-point units and larger devices, the performance of our designs increases accordingly.