Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Active harmony: towards automated performance tuning
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
IBM Journal of Research and Development
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Auto-tuning of fast fourier transform on graphics processors
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Computer Science - Research and Development
Online Adaptive Code Generation and Tuning
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
4.45 Pflops astrophysical N-body simulation on K computer: the gravitational trillion-body problem
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
This paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache performance through loop tiling. To cope with the complex trade-off regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76x over the FFTW library.