High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

Authors:
Ling Zhuo;Gerald R. Morris;Viktor K. Prasanna
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2007

Citing 10
Cited 11

Introduction to Algorithms

Introduction to Algorithms
Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs) (Extended Abstract)

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Memory-Optimal Evaluation of Expression Trees Involving Large Objects

HiPC '99 Proceedings of the 6th International Conference on High Performance Computing
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Sparse Matrix-Vector multiplication on FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
An FPGA-Based Application-Specific Processor for Efficient Reduction of Multiple Variable-Length Floating-Point Data Sets

ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors
A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Advanced Components in the Variable Precision Floating-Point Library

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

An integrated reduction technique for a double precision accumulator

Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
An improved reduction algorithm with deeply pipelined operators

SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A fused hybrid floating-point and fixed-point dot-product for FPGAs

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
Optimising memory bandwidth use for matrix-vector multiplication in iterative methods

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
A scalable approach for automated precision analysis

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Automatic parallelisation for LTI MIMO state space systems using FPGAs. An optimisation for cost & performance

Journal of Parallel and Distributed Computing
A convolve-and-merge approach for exact computations on high-performance reconfigurable computers

International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing
Hardware description and synthesis of control-intensive reconfigurable dataflow architectures (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
An efficient FPGA matrix multiplier for linear system simulation

Proceedings of the 2013 Grand Challenges on Modeling and Simulation Conference

Quantified Score

Hi-index	0.01

Visualization

Abstract

Field programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits, the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design tradeoffs between the number of adders, buffer size and latency, and propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-II Pro FPGA as the target device, we implemented our designs and present performance and area results.