FFTs in external or hierarchical memory
The Journal of Supercomputing
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes
IEEE Transactions on Parallel and Distributed Systems
A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
A Framework for Collective Personalized Communication
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Blue matter: approaching the limits of concurrency for classical molecular dynamics
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimization and evaluation of parallel molecular dynamics simulation on blue Gene/L
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Proceedings of the 22nd annual international conference on Supercomputing
IBM Journal of Research and Development
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
A message passing strategy for array redistributions in a torus network
The Journal of Supercomputing
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine
Computing in Science and Engineering
Performance analysis and projections for Petascale applications on Cray XT series systems
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations
HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
Overview of the QCDSP and QCDOC computers
IBM Journal of Research and Development
Optimization of fast Fourier transforms on the Blue Gene/L supercomputer
HiPC'08 Proceedings of the 15th international conference on High performance computing
On non-blocking collectives in 3D FFTs
Proceedings of the second workshop on Scalable algorithms for large-scale systems
On the communication complexity of 3D FFTs and its implications for Exascale
Proceedings of the 26th ACM international conference on Supercomputing
A framework for low-communication 1-D FFT
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Aspen: a domain specific language for performance modeling
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Can PDES scale in environments with heterogeneous delays?
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A framework for low-communication 1-D FFT
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Torus networks are commonly used for massively parallel computers, its performance often becomes the constraint on total application performance. Especially in an asymmetric torus network, network traffic along the longest axis is the performance bottleneck for all-to-all communication, so that it is important to schedule the longest-axis traffic smoothly. In this paper, we propose a new algorithm based on an indirect method for pipelining the all-to-all procedures using shared memory parallel threads, which (1) isolates the longest-axis traffic from other traffic, (2) schedules it smoothly and (3) overlaps all of the other traffic and overhead for the all-to-all communication behind the longest-axis traffic. The proposed method achieves up to 95% of the theoretical peak. We integrated the overlapped all-to-all method with parallel FFT algorithms. And local FFT calculations are also overlapped behind the longest-axis traffic. The FFT performance achieves up to 90% of the theoretical peak for the parallel 1D FFT.