Scalable matrix decompositions with multiple cores on FPGAs

Authors:
Yi-Gang Tai;Chia-Tien Dan Lo;Kleanthis Psarris
Affiliations:
Department of Computer Science, The University of Texas at San Antonio, San Antonio, TX 78249, USA;Department of Computer Science and Software Engineering, Southern Polytechnic State University, Marietta, GA 30060, USA;School of Natural and Behavioral Science, City University of New York - Brooklyn College, Brooklyn, NY 11210, USA
Venue:
Microprocessors & Microsystems
Year:
2013

Citing 22
Cited 0

A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Implementation of Givens QR-Decomposition in FPGA

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
On systolic arrays for recursive complex Householder transformations with applications to array processing

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Why Systolic Architectures?

Computer
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware

IEEE Transactions on Computers
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems

IEEE Transactions on Computers
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
A truly two-dimensional systolic array FPGA implementation of QR decomposition

ACM Transactions on Embedded Computing Systems (TECS)
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
State-of-the-art in heterogeneous computing

Scientific Programming
Blocking LU Decomposition for FPGAs

FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A unified co-processor architecture for matrix decomposition

Journal of Computer Science and Technology
FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture

IEEE Transactions on Computers
Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction

IEEE Transactions on Parallel and Distributed Systems
FPGA implementation of QR decomposition using MGS algorithm

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Systolic block Householder transformation for RLS algorithm withtwo-level pipelined implementation

IEEE Transactions on Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hardware accelerators are getting increasingly important in heterogeneous systems for many applications, including those that employ matrix decompositions. In recent years, a class of tiled matrix decomposition algorithms has been proposed for out-of-memory computations and multi-core architectures including GPU-based heterogeneous systems. However, on FPGAs these scalable solutions for large matrices are rarely found. In this paper we use the latest tiled decomposition algorithms from high performance linear algebra for off-chip memory access and loop mapping on multiple processing cores for on-chip computation to perform scalable and high performance QR and LU matrix decompositions on FPGAs.