64-bit floating-point FPGA matrix multiplication

Authors:
Yong Dou;S. Vassiliadis;G. K. Kuzmanov;G. N. Gaydadjiev
Affiliations:
National Laboratory for Parallel and Distributed Processing, Changsha, P.R. China;EEMCS, TU Delft, Delft, The Netherlands;EEMCS, TU Delft, Delft, The Netherlands;EEMCS, TU Delft, Delft, The Netherlands
Venue:
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Year:
2005

Citing 14
Cited 32

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Automating Customisation of Floating-Point Designs

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Energy-Efficient Matrix Multiplication on FPGAs

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations

FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Customising Floating-Point Designs

FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Leading Zero Anticipation and Detection A Comparison of Methods

ARITH '01 Proceedings of the 15th IEEE Symposium on Computer Arithmetic
High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories

High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories
Floating Point Unit Generation and Evaluation for FPGAs

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
The MOLEN Polymorphic Processor

IEEE Transactions on Computers
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

H-SIMD Machine: Configurable Parallel Computing for Matrix Multiplication

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Architectures and APIs: assessing requirements for delivering FPGA performance to applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automatic mapping of nested loops to FPGAS

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Examining the viability of FPGA supercomputing

EURASIP Journal on Embedded Systems
A Tool for Unbiased Comparison between Logarithmic and Floating-point Arithmetic

Journal of VLSI Signal Processing Systems
Parameterized floating-point logarithm and exponential functions for FPGAs

Microprocessors & Microsystems
A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Journal of Parallel and Distributed Computing
Clusters Versus FPGA for Parallel Processing of Hyperspectral Imagery

International Journal of High Performance Computing Applications
Computation reuse in domain-specific optimization of signal recognition

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
A network-centric approach to space-restricted distributed processing

Microprocessors & Microsystems
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Polymorphic architectures: from media processing to supercomputing

CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing

Proceedings of the 24th ACM International Conference on Supercomputing
OpenMP extensions for FPGA accelerators

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
VFloat: A Variable Precision Fixed- and Floating-Point Library for Reconfigurable Hardware

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Domain-Specific Optimization of Signal Recognition Targeting FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Automatic generation of fpga-specific pipelined accelerators

ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
FPGA implementation of the conjugate gradient method

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
An FPGA-Based parallel accelerator for matrix multiplications in the newton-raphson method

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Compressed sensing and Cholesky decomposition on FPGAs and GPUs

Parallel Computing
Towards real-time compression of hyperspectral images using virtex-iI FPGAs

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
FPGA-specific synthesis of loop-nests with pipelined computational cores

Microprocessors & Microsystems
FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

The Journal of Supercomputing
VLIW coprocessor for IEEE-754 quadruple-precision elementary functions

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable matrix decompositions with multiple cores on FPGAs

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.