A pipelined-loop-compatible architecture and algorithm to reduce variable-length sets of floating-point data on a reconfigurable computer

Authors:
Gerald R. Morris;Viktor K. Prasanna
Affiliations:
U.S. Army Engineer Research and Development Center, Major Shared Resource Center, 3909 Halls Ferry Road, Vicksburg, MS 39180, United States;University of Southern California, Department of Electrical Engineering, 3740 McClintock Avenue, Los Angeles, CA 90089, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 12
Cited 3

A MATLAB Compiler for Distributed, Heterogeneous, Reconfigurable Computing Systems

FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Fpga Hardware Synthesis From Matlab

VLSID '01 Proceedings of the The 14th International Conference on VLSI Design (VLSID '01)
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An FPGA-Based Floating-Point Jacobi Iterative Solver

ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
High-Performance and Area-Efficient Reduction Circuits on FPGAs

SBAC-PAD '05 Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing
A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Advanced Components in the Variable Precision Floating-Point Library

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

Rapid Prototyping Projection Algorithms with FPGA Technology

RSP '09 Proceedings of the 2009 IEEE/IFIP International Symposium on Rapid System Prototyping
Automatic parallelisation for LTI MIMO state space systems using FPGAs. An optimisation for cost & performance

Journal of Parallel and Distributed Computing
Improving performance of codes with large/irregular stride memory access patterns via high performance reconfigurable computers

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reconfigurable computers (RCs) combine general-purpose processors (GPPs) with field programmable gate arrays (FPGAs). The FPGAs are reconfigured at run time to become application-specific processors that collaborate with the GPPs to execute the application. High-level language (HLL) to hardware description language (HDL) compilers allow the FPGA-based kernels to be generated using HLL-based programming rather than HDL-based hardware design. Unfortunately, the loops needed for floating-point reduction operations often cannot be pipelined by these HLL-HDL compilers. This capability gap prevents the development of a number of important FPGA-based kernels. This article describes a novel architecture and algorithm that allow the use of an HLL-HDL environment to implement high-performance FPGA-based kernels that reduce multiple, variable-length sets of floating-point data. A sparse matrix iterative solver is used to demonstrate the effectiveness of the reduction kernel. The FPGA-augmented version running on a contemporary RC is up to 2.4 times faster than the software-only version of the same solver running on the GPP. Conservative estimates show the solver will run up to 6.3 times faster than software on a next-generation RC.