A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Authors:
Chao Yang;Wei Xue;Haohuan Fu;Lin Gan;Linfeng Li;Yangtong Xu;Yutong Lu;Jiachang Sun;Guangwen Yang;Weimin Zheng
Affiliations:
Institute of Software, Chinese Academy of Sciences, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;National University of Defense Technology, Changsha, China;Institute of Software, Chinese Academy of Sciences, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 18
Cited 0

Spectral transform solutions to the shallow water test set

Journal of Computational Physics
The “cubed sphere”: a new method for the solution of partial differential equations in spherical geometry

Journal of Computational Physics
Strong Stability-Preserving High-Order Time Discretization Methods

SIAM Review
A 26.58 Tflops global atmospheric simulation with the spectral transform method on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
A wave propagation method for hyperbolic systems on the sphere

Journal of Computational Physics
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel multilevel methods for implicit solution of shallow water equations with nonsmooth topography on the cubed-sphere

Journal of Computational Physics
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction

SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Petaflop biofluidics simulations on a two million-core system

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Tianhe-1A Interconnect and Message-Passing Services

IEEE Micro

Quantified Score

Hi-index	0.00

Visualization

Abstract

Developing highly scalable algorithms for global atmospheric modeling is becoming increasingly important as scientists inquire to understand behaviors of the global atmosphere at extreme scales. Nowadays, heterogeneous architecture based on both processors and accelerators is becoming an important solution for large-scale computing. However, large-scale simulation of the global atmosphere brings a severe challenge to the development of highly scalable algorithms that fit well into state-of-the-art heterogeneous systems. Although successes have been made on GPU-accelerated computing in some top-level applications, studies on fully exploiting heterogeneous architectures in global atmospheric modeling are still very less to be seen, due in large part to both the computational difficulties of the mathematical models and the requirement of high accuracy for long term simulations. In this paper, we propose a peta-scalable hybrid algorithm that is successfully applied in a cubed-sphere shallow-water model in global atmospheric simulations. We employ an adjustable partition between CPUs and GPUs to achieve a balanced utilization of the entire hybrid system, and present a pipe-flow scheme to conduct conflict-free inter-node communication on the cubed-sphere geometry and to maximize communication-computation overlap. Systematic optimizations for multithreading on both GPU and CPU sides are performed to enhance computing throughput and improve memory efficiency. Our experiments demonstrate nearly ideal strong and weak scalabilities on up to 3,750 nodes of the Tianhe-1A. The largest run sustains a performance of 0.8 Pflops in double precision (32% of the peak performance), using 45,000 CPU cores and 3,750 GPUs.