Memory requirements for balanced computer architectures
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
The hidden cost of low bandwidth communication
Developing a computer science agenda for high-performance computing
The Future Fast Fourier Transform?
SIAM Journal on Scientific Computing
A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
International Journal of High Performance Computing Applications
The development and integration of a distributed 3D FFT for a cluster of workstations
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
3D-Stacked Memory Architectures for Multi-core Processors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Many-Core vs. Many-Thread Machines: Stay Away From the Valley
IEEE Computer Architecture Letters
A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
IBM Journal of Research and Development
Communication analysis of parallel 3D FFT for flat cartesian meshes on large Blue Gene systems
HiPC'08 Proceedings of the 15th international conference on High performance computing
An empirically tuned 2D and 3D FFT library on CUDA GPU
Proceedings of the 24th ACM International Conference on Supercomputing
Web search using mobile cores: quantifying and mitigating the price of efficiency
Proceedings of the 37th annual international symposium on Computer architecture
Understanding throughput-oriented architectures
Communications of the ACM
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
FAWN: a fast array of wimpy nodes
Communications of the ACM
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
Computer Science - Research and Development
Balance principles for algorithm-architecture co-design
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community
Computing in Science and Engineering
Using the TOP500 to trace and project technology and architecture trends
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Titanium performance and potential: an NPB experimental study
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Aspen: a domain specific language for performance modeling
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
This paper revisits the communication complexity of large-scale 3D fast Fourier transforms (FFTs) and asks what impact trends in current architectures will have on FFT performance at exascale. We analyze both memory hierarchy traffic and network communication to derive suitable analytical models, which we calibrate against current software implementations; we then evaluate models to make predictions about potential scaling outcomes at exascale, based on extrapolating current technology trends. Of particular interest is the performance impact of choosing high-density processors, typified today by graphics co-processors (GPUs), as the base processor for an exascale system. Among various observations, a key prediction is that although inter-node all-to-all communication is expected to be the bottleneck of distributed FFTs, intra-node communication---expressed precisely in terms of the relative balance among compute capacity, memory bandwidth, and network bandwidth---will play a critical role.