Design Tradeoffs for BLAS Operations on Reconfigurable Hardware

Authors:
Affiliations:
Venue:
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Year:
2005

Citing 0
Cited 6

Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Architectures and APIs: assessing requirements for delivering FPGA performance to applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Examining the viability of FPGA supercomputing

EURASIP Journal on Embedded Systems
Architecture for dense matrix multiplication on a high-performance reconfigurable system

Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the Dunes
Parallel FPGA-based all-pairs shortest-paths in a directed graph

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (Field Programmable-Gate Arrays) has become feasible. In this paper, we propose FPGA-based designs for several BLAS operations, including vector product, matrix-vector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we analyze the design tradeoffs. In the implementations of the designs, the values of the design parameters are determined according to the hardware constraints, such as the available area, the size of on-chip memory, the external memory bandwidth and the number of I/O pins. The proposed designs are implemented on a Xilinx Virtex-II Pro FPGA.