Communications of the ACM
Implementing the beam and warming method on the hypercube
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Efficient implementation of a 3-dimensional ADI method on the iPSC/860
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A framework for performance modeling and prediction
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A regression-based approach to scalability prediction
Proceedings of the 22nd annual international conference on Supercomputing
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
ScalaExtrap: trace-based communication extrapolation for spmd programs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Simulating Whole Supercomputer Applications
IEEE Micro
Performance modeling: understanding the past and predicting the future
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A new proposal to deal with congestion in InfiniBand-based fat-trees
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The NAS Parallel Benchmarks (NPB) are a well-known suite of benchmarks that proxy scientific computing applications. They specify several problem sizes that represent how such applications may run on different sizes of HPC systems. However, even the largest problem (class F) is still far too small to exercise properly a petascale supercomputer. Our work shows how one may scale the Block Tridiagonal (BT) NPB from today's published size to petascale and exascale computing systems. In this paper we discuss the pros and cons of various ways of scaling. We discuss how scaling BT would impact computation, memory access, and communications, and highlight the expected bottleneck, which turns out to be not memory or communication bandwidth, but network latency. Two complementary ways are presented to overcome latency obstacles. We also describe a practical method to gather approximate performance data for BT at exascale on actual hardware, without requiring an exascale system.