Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

Authors:
David Boland;George A. Constantinides
Affiliations:
Imperial College London, UK;Imperial College London, UK
Venue:
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Year:
2011

Citing 11
Cited 1

The numerical solution of ordinary and partial differential equations

The numerical solution of ordinary and partial differential equations
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Scientific Computing

Scientific Computing
Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Floating-point sparse matrix-vector multiply for FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Introduction to Mathematical Programming: Applications and Algorithms

Introduction to Mathematical Programming: Applications and Algorithms
High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

IEEE Transactions on Parallel and Distributed Systems
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

ARC '08 Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and Applications
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications

A scalable approach for automated precision analysis

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over General-Purpose Processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelization of the matrix-vector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of on-chip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This article proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimize the RAM use, in order to both increase the performance and retain this performance for larger-order matrices.