Numerical recipes: the art of scientific computing
Numerical recipes: the art of scientific computing
Hypercube algorithms and implementations
SIAM Journal on Scientific and Statistical Computing
Parallel Computers Two: Architecture, Programming and Algorithms
Parallel Computers Two: Architecture, Programming and Algorithms
What have we learnt from using real parallel machines to solve real problems?
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Efficient algorithms for all-to-all communications in multi-port message-passing systems
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
Complex two-dimensional FFTs up to size 256 x 256 points are implemented on the Intel iPSC/System 286 hypercube with emphasis on comparing the effects of data mapping, data transposition or communication needs, and the use of distributed FFTs. Two new implementations of the 2D-FFT include the Local-Distributed method which performs local FFTs in one direction followed by distributed FFTs in the other direction, and a Vector-Radix implementation that is derived from decimating the DFT in two-dimensions instead of one. In addition, the Transpose-Split method involving local FFTs in both directions with an intervening matrix transposition and the Block 2D-FFT involving distributed FFT butterflies in both directions are implemented and compared with the other two methods. Timing results show that on the Intel iPSC/System 286, there is hardly any difference between the methods, with the only differences arising from the efficiency or inefficiency of communication. Since the Intel cannot overlap communication and computation, this forces the user to buffer data. In some of the methods, this causes processor blocking during communication. Issues of vectorization, communication strategies, data storage and buffering requirements are investigated. A model is given that compares vectorization and communication complexity. While timing results show that the Transpose-Split method is in general slightly faster, our model shows that the Block method and Vector-Radix method have the potential to be faster if the communication difficulties were taken care of. Therefore if communication could be “hidden” within computation, the latter two methods can become useful with the Block method vectorizing the best and the Vector-Radix method having 25% fewer multiplications than row-column 2D-FFT methods. Finally the Local-Distributed method is a good hybrid method requiring no transposing and can be useful in certain circumstances. This paper provides some general guidelines in evaluating parallel distributed 2D-FFT implementations and concludes that while different methods may be best suited for different systems, better implementation techniques as well as faster algorithms still perform better when communication become more efficient.