Sparse Matrix-Vector multiplication on FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An improved reduction algorithm with deeply pipelined operators
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Hi-index | 0.00 |
FPGA-based floating-point kernels must exploit algorithmic parallelism and use deeply pipelined cores to gain a performance advantage over general-purpose processors. Inability to hide the latency of lengthy pipelines cansignificantly reduce the performance or impose unrealistic buffer requirements. Designs requiring reduction operations such as accumulation are particularly susceptible. In this paper we introduce two high-performance FPGA-based methods for reducing multiple sets of sequentially delivered floating-point values in optimal time without stalling the pipeline.