Comparison of two-dimensional FFT methods on the hypercube

Authors:
C. Y. Chu
Affiliations:
Northrop Corp., Aircraft Division, One Northrop Avenue, 3812/82, Hawthorne, CA
Venue:
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Year:
1989

Citing 3
Cited 4

Numerical recipes: the art of scientific computing

Numerical recipes: the art of scientific computing
Hypercube algorithms and implementations

SIAM Journal on Scientific and Statistical Computing
Parallel Computers Two: Architecture, Programming and Algorithms

Parallel Computers Two: Architecture, Programming and Algorithms

What have we learnt from using real parallel machines to solve real problems?

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Efficient algorithms for all-to-all communications in multi-port message-passing systems

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Complex two-dimensional FFTs up to size 256 x 256 points are implemented on the Intel iPSC/System 286 hypercube with emphasis on comparing the effects of data mapping, data transposition or communication needs, and the use of distributed FFTs. Two new implementations of the 2D-FFT include the Local-Distributed method which performs local FFTs in one direction followed by distributed FFTs in the other direction, and a Vector-Radix implementation that is derived from decimating the DFT in two-dimensions instead of one. In addition, the Transpose-Split method involving local FFTs in both directions with an intervening matrix transposition and the Block 2D-FFT involving distributed FFT butterflies in both directions are implemented and compared with the other two methods. Timing results show that on the Intel iPSC/System 286, there is hardly any difference between the methods, with the only differences arising from the efficiency or inefficiency of communication. Since the Intel cannot overlap communication and computation, this forces the user to buffer data. In some of the methods, this causes processor blocking during communication. Issues of vectorization, communication strategies, data storage and buffering requirements are investigated. A model is given that compares vectorization and communication complexity. While timing results show that the Transpose-Split method is in general slightly faster, our model shows that the Block method and Vector-Radix method have the potential to be faster if the communication difficulties were taken care of. Therefore if communication could be “hidden” within computation, the latter two methods can become useful with the Block method vectorizing the best and the Vector-Radix method having 25% fewer multiplications than row-column 2D-FFT methods. Finally the Local-Distributed method is a good hybrid method requiring no transposing and can be useful in certain circumstances. This paper provides some general guidelines in evaluating parallel distributed 2D-FFT implementations and concludes that while different methods may be best suited for different systems, better implementation techniques as well as faster algorithms still perform better when communication become more efficient.