A faster distributed arithmetic architecture for FPGAs

Authors:
Radhika S. Grover;Weijia Shang;Qiang Li
Affiliations:
Santa Clara University, CA;Santa Clara University, CA;Santa Clara University, CA
Venue:
FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Year:
2002

Citing 3
Cited 4

Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques

IEEE Transactions on Computers
Applying an XC6200 to Real-Time Image Processing

IEEE Design & Test
Efficient implementation of the DCT on custom computers

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines

A novel FPGA logic block for improved arithmetic performance

Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
POWER-Area-Performance Characteristics of FPGA-based Sigma-Delta FIR Filters

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in FPGAs. However, traditional lookup table (LUT) based DA architectures contain one or more carry propagation chains in the critical path that dictates the fastest time at which an entire design can run. In this paper, we describe a novel technique that can reduce or eliminate the carry-propagate chain from the critical path in LUT based DA architectures on FPGAs. In the proposed scheme, the individual bits of a word do not have to be processed as a unit. Instead, the current iteration can start as soon as the least significant bit (LSB) of the previous iteration is available, without waiting for the entire word from the previous iteration to be fully computed. This technique has great potential in speeding up DSP applications based on DA. Designs are described for serial and parallel DALUT and accumulator structures in which an n-bit carry chain, where n is the word length, is broken into smaller r-bit chains, 1*nnr n . A cost-performance analysis of the designs is presented. The analysis shows that the designs proposed in this paper have a lower cost-performance ratio (indicating better performance) than traditional DA designs. We also show that the 8-bit (r = 8) designs offer a good compromise between cost and performance. The implementation is on a Xilinx chip XC4028XL-3-BG256 using Xilinx Foundation tools v 3.1i. The results show that the proposed designs can achieve speedup by a factor of at least 1.5 over traditional DA designs in some cases.