Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Authors:
Jun Doi;Yasushi Negishi
Affiliations:
-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 18
Cited 9

FFTs in external or hierarchical memory

The Journal of Supercomputing
All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes

IEEE Transactions on Parallel and Distributed Systems
A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Blue matter: approaching the limits of concurrency for classical molecular dynamics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimization and evaluation of parallel molecular dynamics simulation on blue Gene/L

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
Fine-grained parallelization of the Car-Parrinello ab initio molecular dynamics method on the IBM Blue Gene/L supercomputer

IBM Journal of Research and Development
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
A message passing strategy for array redistributions in a torus network

The Journal of Supercomputing
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine

Computing in Science and Engineering
Performance analysis and projections for Petascale applications on Cray XT series systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Overview of the QCDSP and QCDOC computers

IBM Journal of Research and Development
Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

Computer
Optimization of fast Fourier transforms on the Blue Gene/L supercomputer

HiPC'08 Proceedings of the 15th international conference on High performance computing

On non-blocking collectives in 3D FFTs

Proceedings of the second workshop on Scalable algorithms for large-scale systems
On the communication complexity of 3D FFTs and its implications for Exascale

Proceedings of the 26th ACM international conference on Supercomputing
A framework for low-communication 1-D FFT

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Aspen: a domain specific language for performance modeling

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Can PDES scale in environments with heterogeneous delays?

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A framework for low-communication 1-D FFT

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Torus networks are commonly used for massively parallel computers, its performance often becomes the constraint on total application performance. Especially in an asymmetric torus network, network traffic along the longest axis is the performance bottleneck for all-to-all communication, so that it is important to schedule the longest-axis traffic smoothly. In this paper, we propose a new algorithm based on an indirect method for pipelining the all-to-all procedures using shared memory parallel threads, which (1) isolates the longest-axis traffic from other traffic, (2) schedules it smoothly and (3) overlaps all of the other traffic and overhead for the all-to-all communication behind the longest-axis traffic. The proposed method achieves up to 95% of the theoretical peak. We integrated the overlapped all-to-all method with parallel FFT algorithms. And local FFT calculations are also overlapped behind the longest-axis traffic. The FFT performance achieves up to 90% of the theoretical peak for the parallel 1D FFT.