IEEE Transactions on Computers
Applying an XC6200 to Real-Time Image Processing
IEEE Design & Test
Efficient implementation of the DCT on custom computers
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
A novel FPGA logic block for improved arithmetic performance
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
POWER-Area-Performance Characteristics of FPGA-based Sigma-Delta FIR Filters
Journal of Signal Processing Systems
Hi-index | 0.00 |
Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in FPGAs. However, traditional lookup table (LUT) based DA architectures contain one or more carry propagation chains in the critical path that dictates the fastest time at which an entire design can run. In this paper, we describe a novel technique that can reduce or eliminate the carry-propagate chain from the critical path in LUT based DA architectures on FPGAs. In the proposed scheme, the individual bits of a word do not have to be processed as a unit. Instead, the current iteration can start as soon as the least significant bit (LSB) of the previous iteration is available, without waiting for the entire word from the previous iteration to be fully computed. This technique has great potential in speeding up DSP applications based on DA. Designs are described for serial and parallel DALUT and accumulator structures in which an n-bit carry chain, where n is the word length, is broken into smaller r-bit chains, 1*nnr n . A cost-performance analysis of the designs is presented. The analysis shows that the designs proposed in this paper have a lower cost-performance ratio (indicating better performance) than traditional DA designs. We also show that the 8-bit (r = 8) designs offer a good compromise between cost and performance. The implementation is on a Xilinx chip XC4028XL-3-BG256 using Xilinx Foundation tools v 3.1i. The results show that the proposed designs can achieve speedup by a factor of at least 1.5 over traditional DA designs in some cases.