A parallel QR factorization algorithm with controlled local pivoting
SIAM Journal on Scientific and Statistical Computing
Implementation of Givens QR-Decomposition in FPGA
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Householder Bidiagonalization on Parallel Computers with Dynamic Ring Architecture
PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
New Partitioning Schemes for Parallel Modified Gram-Schmidt Orthogonalization
ISPAN '97 Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks
An Improved Systolic Architecture for LU Decomposition
ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
VLSI Architecture for Matrix Inversion using Modified Gram-Schmidt based QR Decomposition
VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Systolic Algorithms and Architectures for High-Throughput Processing Applications
Journal of Signal Processing Systems
Comparison of different parallel modified gram-schmidt algorithms
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A linear systolic array for recursive least squares
IEEE Transactions on Signal Processing
The Journal of Supercomputing
Scalable matrix decompositions with multiple cores on FPGAs
Microprocessors & Microsystems
Hi-index | 0.00 |
QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified coprocessor structure with a scalable linear array of processing elements (PEs), in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads.