IEEE Transactions on Computers
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Design Tradeoffs for BLAS Operations on Reconfigurable Hardware
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs
IEEE Transactions on Parallel and Distributed Systems
FPGA-based, floating-point reduction operations
MATH'06 Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS
Journal of Parallel and Distributed Computing
An improved reduction algorithm with deeply pipelined operators
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
VFloat: A Variable Precision Fixed- and Floating-Point Library for Reconfigurable Hardware
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Accelerating DTI tractography using FPGAs
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An efficient FPGA matrix multiplier for linear system simulation
Proceedings of the 2013 Grand Challenges on Modeling and Simulation Conference
Hi-index | 0.00 |
The use of pipelined floating-point arithmetic cores to create high-performance FPGA-based computational kernels has introduced a new class of problems that do not exist when using single-cycle arithmetic cores. In particular, the data hazards associated with pipelined floating-point reduction circuits can limit the scalability or severely reduce the performance of an otherwise high-performance computational kernel. The inability to efficiently execute the reduction in hardware coupled with memory bandwidth issues may even negate the performance gains derived from hardware acceleration of the kernel. In this paper we introduce a method for developing scalable floating-point reduction circuits that run in optimal time while requiring only 驴(lg (n)) space and a single pipelined floating-point unit. Using a Xilinx Virtex-II Pro as the target device, we implement reference instances of our reduction method and present the FPGA design statistics supporting our scalability claims.