The cube-connected cycles: a versatile network for parallel computation
Communications of the ACM
Fast Transforms: Algorithms, Analyses, Applications
Fast Transforms: Algorithms, Analyses, Applications
A model of computation for VLSI with related complexity results
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
VLSI Implementation of Digital Fourier Transforms, Final Report
VLSI Implementation of Digital Fourier Transforms, Final Report
Bitonic Sort on a Mesh-Connected Parallel Computer
IEEE Transactions on Computers
Parallel Processing with the Perfect Shuffle
IEEE Transactions on Computers
A Mesh-Connected Area-Time Optimal VLSI Multiplier of Large Integers
IEEE Transactions on Computers
Two VLSI Structures for the Discrete Fourier Transform
IEEE Transactions on Computers
An architecture for a VLSI FFT processor
Integration, the VLSI Journal
VLSI Sorting with Reduced Hardware
IEEE Transactions on Computers
An overview of the Penn State design system
DAC '87 Proceedings of the 24th ACM/IEEE Design Automation Conference
IEEE Transactions on Computers
DECOMPOSER: a synthesizer for systolic systems
DAC '88 Proceedings of the 25th ACM/IEEE Design Automation Conference
An Orthogonal Time-Frequency Extraction Approach to 2D Systolic Architecture for 1D DFT Computation
Journal of VLSI Signal Processing Systems
Architectural design of array processors for multi-dimensional discrete Fourier transform
Highly parallel computaions
Hi-index | 14.98 |
We present the design of a VLSI processor which can be programmed to compute the discrete Fourier transform of a sequence of n points and which achieves the theoretical AT2 lower bound of 驴(n2) for n 驴 n where n is an infinite set. Furthermore, since the set n is also sufficiently dense, the processor achieves for any n the theoretical AT2 lower bound of 驴(n2) for computing the cyclic convolution of two sequences of n points. Uniquely, our design achieves this bound without the use of data shuffling or long wires. Also, the processor uses only approximately 驴n multipliers, while many other designs need 驴(n) multipliers to achieve the same time bounds. Since multipliers are usually much larger than adders, the processor presented in this paper should be smaller. The design also features layout regularity, minimal control, and nearest neighbor interconnect of arithmetic cells of a few different types. These characteristics make it an ideal candidate for VLSI implementation.