Architecture and CAD for Deep-Submicron FPGAs
Architecture and CAD for Deep-Submicron FPGAs
A synthesizable datapath-oriented embedded FPGA fabric
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Architectural modifications to enhance the floating-point performance of FPGAs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architectural enhancements in Stratix-III™ and Stratix-IV™
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Flexible multi-mode embedded floating-point unit for field programmable gate arrays
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Cholesky decomposition using fused datapath synthesis
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Field Programmable Compressor Trees: Acceleration of Multi-Input Addition on FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
FPGA Floating Point Datapath Compiler
FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
Floating-point FPGA: architecture and modeling
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Bridge floating-point fused multiply-add design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Measuring the Gap Between FPGAs and ASICs
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
The VTR project: architecture and CAD for FPGAs from verilog to routing
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Hi-index | 0.00 |
This paper introduces a methodology to optimize coarse-grained floating point units (FPUs) in a hybrid field-programmable gate array (FPGA), where the FPU consists of a number of interconnected floating point adders/subtracters (FAs), multipliers (FMs), and wordblocks (WBs). The wordblocks include registers and lookup tables (LUTs) which can implement fixed point operations efficiently. We employ common subgraph extraction to determine the best mix of blocks within an FPU and study the area, speed and utilization tradeoff over a set of floating point benchmark circuits. We then explore the system impact of FPU density and flexibility in terms of area, speed, and routing resources. Finally, we derive an optimized coarse-grained FPU by considering both architectural and system-level issues. This proposed methodology can be used to evaluate a variety of FPU architecture optimizations. The results for the selected FPU architecture optimization show that although high density FPUs are slower, they have the advantages of improved area, area-delay product, and throughput.