Introduction to Algorithms
HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Memory-Optimal Evaluation of Expression Trees Involving Large Objects
HiPC '99 Proceedings of the 6th International Conference on High Performance Computing
FPGAs vs. CPUs: trends in peak floating-point performance
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance
FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Advanced Components in the Variable Precision Floating-Point Library
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
An integrated reduction technique for a double precision accumulator
Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
An improved reduction algorithm with deeply pipelined operators
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A fused hybrid floating-point and fixed-point dot-product for FPGAs
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
A scalable approach for automated precision analysis
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Journal of Parallel and Distributed Computing
A convolve-and-merge approach for exact computations on high-performance reconfigurable computers
International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
An efficient FPGA matrix multiplier for linear system simulation
Proceedings of the 2013 Grand Challenges on Modeling and Simulation Conference
Hi-index | 0.01 |
Field programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits, the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design tradeoffs between the number of adders, buffer size and latency, and propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-II Pro FPGA as the target device, we implemented our designs and present performance and area results.