Floating-point sparse matrix-vector multiply for FPGAs

Authors:
Michael deLorimier;André DeHon
Affiliations:
California Institute of Technology, Pasadena, CA;California Institute of Technology, Pasadena, CA
Venue:
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Year:
2005

Citing 8
Cited 26

Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Improved algorithms for hypergraph bipartitioning

ASP-DAC '00 Proceedings of the 2000 Asia and South Pacific Design Automation Conference
A Library of Parameterized Floating-Point Modules and Their Use

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
JHDL - An HDL for Reconfigurable Systems

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Embedded floating-point units in FPGAs

Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Microminiature packaging and integrated circuitry: the work of E. F. Rent, with an application to on-chip interconnection requirements

IBM Journal of Research and Development - POWER5 and packaging
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Architectures and APIs: assessing requirements for delivering FPGA performance to applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sparse Matrix Computations on Reconfigurable Hardware

Computer
A Tool for Unbiased Comparison between Logarithmic and Floating-point Arithmetic

Journal of VLSI Signal Processing Systems
Parameterized floating-point logarithm and exponential functions for FPGAs

Microprocessors & Microsystems
Multivariate Gaussian Random Number Generation Targeting Reconfigurable Hardware

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Architectural modifications to enhance the floating-point performance of FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Computation reuse in domain-specific optimization of signal recognition

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Floating-point divider design for FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
From Silicon to Science: The Long Road to Production Reconfigurable Supercomputing

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An integrated reduction technique for a double precision accumulator

Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer with Application

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Domain-Specific Optimization of Signal Recognition Targeting FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Spatial hardware implementation for sparse graph algorithms in GraphStep

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Portable and scalable FPGA-based acceleration of a direct linear system solver

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A scalable approach for automated precision analysis

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Compiled multithreaded data paths on FPGAs for dynamic workloads

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
A Multiple-FPGA parallel computing architecture for real-time simulation of soft-object deformation

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large, high density FPGAs with high local distributed memory bandwidth surpass the peak floating-point performance of high-end, general-purpose processors. Microprocessors do not deliver near their peak floating-point performance on efficient algorithms that use the Sparse Matrix-Vector Multiply (SMVM) kernel. In fact, it is not uncommon for microprocessors to yield only 10--20% of their peak floating-point performance when computing SMVM. We develop and analyze a scalable SMVM implementation on modern FPGAs and show that it can sustain high throughput, near peak, floating-point performance. For benchmark matrices from the Matrix Market Suite we project 1.5 double precision Gflops/FPGA for a single Virtex II 6000-4 and 12 double precision Gflops for 16 Virtex IIs (750Mflops/FPGA).