A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Implementation of Givens QR-Decomposition in FPGA
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
64-bit floating-point FPGA matrix multiplication
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Computer
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware
IEEE Transactions on Computers
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
IEEE Transactions on Computers
QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
A truly two-dimensional systolic array FPGA implementation of QR decomposition
ACM Transactions on Embedded Computing Systems (TECS)
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
State-of-the-art in heterogeneous computing
Scientific Programming
Blocking LU Decomposition for FPGAs
FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A unified co-processor architecture for matrix decomposition
Journal of Computer Science and Technology
FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture
IEEE Transactions on Computers
Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction
IEEE Transactions on Parallel and Distributed Systems
FPGA implementation of QR decomposition using MGS algorithm
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Systolic block Householder transformation for RLS algorithm withtwo-level pipelined implementation
IEEE Transactions on Signal Processing
Hi-index | 0.00 |
Hardware accelerators are getting increasingly important in heterogeneous systems for many applications, including those that employ matrix decompositions. In recent years, a class of tiled matrix decomposition algorithms has been proposed for out-of-memory computations and multi-core architectures including GPU-based heterogeneous systems. However, on FPGAs these scalable solutions for large matrices are rarely found. In this paper we use the latest tiled decomposition algorithms from high performance linear algebra for off-chip memory access and loop mapping on multiple processing cores for on-chip computation to perform scalable and high performance QR and LU matrix decompositions on FPGAs.