The Chebyshev iteration revisited
Parallel Computing - Parallel matrix algorithms and applications
Practical performance portability in the Parallel Ocean Program (POP): Research Articles
Concurrency and Computation: Practice & Experience - The High Performance Architectural Challenge: Mass Market versus Proprietary Components?
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
Scaling climate simulation applications on the IBM Blue Gene/L system
IBM Journal of Research and Development
Performance of the community earth system model
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Computational performance of ultra-high-resolution capability in the Community Earth System Model
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
This paper represents a novel strategy to improve the scalability of the barotropic mode in the Parallel Ocean Program (POP), by theoretically analyzing the barotropic communications bottleneck. POP discretizes the elliptic equations of the barotropic mode into a linear system Ax=b and solves it using the Preconditioned Conjugate Gradient (PCG) method. PCG scales poorly on distributed systems because of the time-consuming global reductions needed by the inner products in each iteration. A performance model is developed to quantify the scaling bottleneck of PCG. Based on this model, the classical Stiefel iteration (CSI), which was originally supposed to be less efficient than PCG, is identified as being promising for massive parallelism. In contrast to PCG, the recurrence parameters of CSI are determined by the spectrum of the coefficient matrix A instead of the inner product of the residuals in previous iterations. The Lanczos method is used to resolve the difficulty of estimating the eigenvalues of the large-scale matrix A. It constructs a small-scale tridiagonal matrix that has eigenvalues close to A. By replacing PCG with CSI, global reductions and their inherent poor scalability are eliminated in the barotropic mode. The implementation of CSI in POP with a 0.1 degree resolution can accerlate one barotropic step by five times, from 1.23s to 0.26s, on 15,000 cores.