On the communication complexity of 3D FFTs and its implications for Exascale

Authors:
Kenneth Czechowski;Casey Battaglino;Chris McClanahan;Kartik Iyer;P.-K. Yeung;Richard Vuduc
Affiliations:
Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 28
Cited 1

Memory requirements for balanced computer architectures

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
The hidden cost of low bandwidth communication

Developing a computer science agenda for high-performance computing
The Future Fast Fourier Transform?

SIAM Journal on Scientific Computing
A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Parallel Distributed FFT-Based Solvers for 3-D Poisson Problems in Meso-Scale Atmospheric Simulations

International Journal of High Performance Computing Applications
SeaStar Interconnect: Balanced Bandwidth for Scalable Performance

IEEE Micro
The development and integration of a distributed 3D FFT for a cluster of workstations

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
3D-Stacked Memory Architectures for Multi-core Processors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Many-Core vs. Many-Thread Machines: Stay Away From the Valley

IEEE Computer Architecture Letters
A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements

IBM Journal of Research and Development
Communication analysis of parallel 3D FFT for flat cartesian meshes on large Blue Gene systems

HiPC'08 Proceedings of the 15th international conference on High performance computing
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
Understanding throughput-oriented architectures

Communications of the ACM
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
FAWN: a fast array of wimpy nodes

Communications of the ACM
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Computer Science - Research and Development
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Computing in Science and Engineering
Using the TOP500 to trace and project technology and architecture trends

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Titanium performance and potential: an NPB experimental study

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing

Aspen: a domain specific language for performance modeling

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper revisits the communication complexity of large-scale 3D fast Fourier transforms (FFTs) and asks what impact trends in current architectures will have on FFT performance at exascale. We analyze both memory hierarchy traffic and network communication to derive suitable analytical models, which we calibrate against current software implementations; we then evaluate models to make predictions about potential scaling outcomes at exascale, based on extrapolating current technology trends. Of particular interest is the performance impact of choosing high-density processors, typified today by graphics co-processors (GPUs), as the base processor for an exascale system. Among various observations, a key prediction is that although inter-node all-to-all communication is expected to be the bottleneck of distributed FFTs, intra-node communication---expressed precisely in terms of the relative balance among compute capacity, memory bandwidth, and network bandwidth---will play a critical role.