Givens elimination on systolic arrays
ICS '88 Proceedings of the 2nd international conference on Supercomputing
Unitary Triangularization of a Nonsymmetric Matrix
Journal of the ACM (JACM)
IEEE Transactions on Computers - Special issue on computer arithmetic
Implementation of Givens QR-Decomposition in FPGA
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Logarithmic Number System and Floating-Point Arithmetics on FPGA
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
FPGA based Embedded Processing Architecture for the QRD-RLS Algorithm
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Advanced Components in the Variable Precision Floating-Point Library
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Journal of VLSI Signal Processing Systems
Effective anonymization of query logs
Proceedings of the 18th ACM conference on Information and knowledge management
Scalable matrix decompositions with multiple cores on FPGAs
Microprocessors & Microsystems
Hi-index | 0.00 |
We have implemented a two-dimensional systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Compared to previous work that implements Givens rotations using a one-dimensional systolic array, our implementation uses a truly two-dimensional systolic array architecture. As a result, latency scales well for larger matrices. In addition, prior work avoids divide and square root operations in the Givens rotation algorithm by using special operations such as CORDIC or special number systems such as the logarithmic number system (LNS). In contrast, our design uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system. In our design, the input matrix size can be configured at compile time to many different sizes, making it easily scalable to future large FPGAs or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format. The peak performance for a 12 × 12 input matrix is approximately 35 GFLOPs.