A unified co-processor architecture for matrix decomposition

Authors:
Yong Dou;Jie Zhou;Gui-Ming Wu;Jing-Fei Jiang;Yuan-Wu Lei;Shi-Ce Ni
Affiliations:
National Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha, China;National Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha, China;National Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha, China;National Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha, China;National Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha, China;National Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha, China
Venue:
Journal of Computer Science and Technology
Year:
2010

Citing 11
Cited 2

A parallel QR factorization algorithm with controlled local pivoting

SIAM Journal on Scientific and Statistical Computing
Implementation of Givens QR-Decomposition in FPGA

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Householder Bidiagonalization on Parallel Computers with Dynamic Ring Architecture

PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
New Partitioning Schemes for Parallel Modified Gram-Schmidt Orthogonalization

ISPAN '97 Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks
An Improved Systolic Architecture for LU Decomposition

ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
On systolic arrays for recursive complex Householder transformations with applications to array processing

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
VLSI Architecture for Matrix Inversion using Modified Gram-Schmidt based QR Decomposition

VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Systolic Algorithms and Architectures for High-Throughput Processing Applications

Journal of Signal Processing Systems
Comparison of different parallel modified gram-schmidt algorithms

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A linear systolic array for recursive least squares

IEEE Transactions on Signal Processing

FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

The Journal of Supercomputing
Scalable matrix decompositions with multiple cores on FPGAs

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified coprocessor structure with a scalable linear array of processing elements (PEs), in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads.