Architecture of the Component Collective Messaging Interface
International Journal of High Performance Computing Applications
Optimization of fast Fourier transforms on the Blue Gene/L supercomputer
HiPC'08 Proceedings of the 15th international conference on High performance computing
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The IBM Blue Gene/Q interconnection network and message unit
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the communication complexity of 3D FFTs and its implications for Exascale
Proceedings of the 26th ACM international conference on Supercomputing
A framework for low-communication 1-D FFT
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Looking under the hood of the IBM blue gene/Q network
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Pattern-independent detection of manual collectives in MPI programs
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
LEF: long edge first routing for two-dimensional mesh network on chip
Proceedings of the Sixth International Workshop on Network on Chip Architectures
A framework for low-communication 1-D FFT
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-Transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64X32X32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.