High Performance Linear Algebra Operations on Reconfigurable Systems

Authors:
Ling Zhuo;Viktor K. Prasanna
Affiliations:
University of Southern California;University of Southern California
Venue:
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Year:
2005

Citing 12
Cited 17

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Optimizing FPGA-Based Vector Product Designs

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
Design Tradeoffs for BLAS Operations on Reconfigurable Hardware

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
High-Performance FPGA-Based General Reduction Methods

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
An FPGA-Based Floating-Point Jacobi Iterative Solver

ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks

Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Architectures and APIs: assessing requirements for delivering FPGA performance to applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automatic mapping of nested loops to FPGAS

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Sparse Matrix Computations on Reconfigurable Hardware

Computer
Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

IEEE Transactions on Parallel and Distributed Systems
A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Journal of Parallel and Distributed Computing
Experiences with a FPGA-based Reed/Solomon-encoding coprocessor

Microprocessors & Microsystems
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

ARC '08 Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and Applications
A HyperTransport 3 Physical Layer Interface for FPGAs

ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
Parallel backprojection: a case study in high-performance reconfigurable computing

EURASIP Journal on Embedded Systems - FPGA supercomputing platforms, architectures, and techniques for accelerating computationally complex algorithms
Architecture for dense matrix multiplication on a high-performance reconfigurable system

Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the Dunes
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
TCHo: a hardware-oriented trapdoor cipher

ACISP'07 Proceedings of the 12th Australasian conference on Information security and privacy
A programmable parallel accelerator for learning and classification

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

ACM Transactions on Architecture and Code Optimization (TACO)
FPGA-specific synthesis of loop-nests with pipelined computational cores

Microprocessors & Microsystems
A fast poisson solver for hybrid reconfigurable system

ARC'13 Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Field-Programmable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for state-of-the-art reconfigurable systems. We study three data-intensive operations: dot product, matrix-vector multiply and dense matrix multiply. The first two operations are I/O bound, and our designs efficiently utilize the available memory bandwidth in the systems. As these operations require accumulation of sequentially delivered floating-point values, we develop a high performance reduction circuit. This circuit uses only one floating-point adder and buffers of moderate size. For matrix multiply operation, we propose a design which employs a linear array of FPGAs. This design exploits the memory hierarchy in the reconfigurable systems, and has very low memory bandwidth requirements. To illustrate our ideas, we have implemented our designs for Level 2 and Level 3 BLAS on Cray XD1.