Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
Improving the memory-system performance of sparse-matrix vector multiplication
IBM Journal of Research and Development
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Templates for the solution of algebraic eigenvalue problems: a practical guide
Templates for the solution of algebraic eigenvalue problems: a practical guide
On Sparse Matrix-Vector Multiplication with FPGA-Based System
FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons
FCCM '01 Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
High-Performance FPGA-Based General Reduction Methods
FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Embedded floating-point units in FPGAs
Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Architectures and APIs: assessing requirements for delivering FPGA performance to applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs
IEEE Transactions on Parallel and Distributed Systems
Multivariate Gaussian Random Number Generation Targeting Reconfigurable Hardware
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Area-efficient arithmetic expression evaluation using deeply pipelined floating-point cores
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architectural modifications to enhance the floating-point performance of FPGAs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Computation reuse in domain-specific optimization of signal recognition
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Floating-point divider design for FPGAs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
From Silicon to Science: The Long Road to Production Reconfigurable Supercomputing
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An integrated reduction technique for a double precision accumulator
Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer with Application
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
3-D brain MRI tissue classification on FPGAs
IEEE Transactions on Image Processing
Haptic rendering of deformable objects using a multiple FPGA parallel computing architecture
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Layout aware optimization of high speed fixed coefficient FIR filters for FPGAs
International Journal of Reconfigurable Computing
Domain-Specific Optimization of Signal Recognition Targeting FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Portable and scalable FPGA-based acceleration of a direct linear system solver
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Journal of Parallel and Distributed Computing
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Compiled multithreaded data paths on FPGAs for dynamic workloads
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
A Multiple-FPGA parallel computing architecture for real-time simulation of soft-object deformation
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientific and engineering applications. The poor data locality of sparse matrices significantly reduces the performance of SpMXV on general-purpose processors, which rely heavily on the cache hierarchy to achieve high performance. The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of SpMXV. In this paper, we propose an FPGA-based design for SpMXV. Our design accepts sparse matrices in Compressed Row Storage format, and makes no assumptions about the sparsity structure of the input matrix. The design employs IEEE-754 format double-precision floating-point multipliers/adders, and performs multiple floating-point operations as well as I/O operations in parallel. The performance of our design for SpMXV is evaluated using various sparse matrices from the scientific computing community, with the Xilinx Virtex-II Pro XC2VP70 as the target device. The MFLOPS performance increases with the hardware resources on the device as well as the available memory bandwidth. For example, when the memory bandwidth is 8 GB/s, our design achieves over 350 MFLOPS for all the test matrices. It demonstrates significant speedup over general-purpose processors particularly for matrices with very irregular sparsity structure. Besides solving SpMXV problem, our design provides a parameterized and flexible tree-based design for floating-point applications on FPGAs.