Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Operating Systems: Internals and Design Principles
Operating Systems: Internals and Design Principles
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Running the NIM Next-Generation Weather Model on GPUs
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
GPU Computing for Atmospheric Modeling
Computing in Science and Engineering
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The microphysical process that leads to cloud and precipitation formation is one of the most important physical processes in numerical weather prediction (NWP) and climate models. The Weather Research Forecast (WRF) Single Moment 6-class (WSM6) microphysics scheme in the Global/Regional Assimilation and Prediction System (GRAPES) includes predictive variables of water vapor, cloud water, cloud ice, rain, snow and graupel. The computation of WSM6 is the most time-consuming portion among that of the entire GRAPES model. In recent years, with the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) with the advantage of low-power, low-cost, and high-performance computing capacity have been exploited to accomplish the arithmetic operations in scientific and engineering simulations. In this paper, we present an implementation of the WSM6 scheme in GRAPES using GPU to accelerate the computation. After a brief introduction to the WSM6 scheme, the data dependence for the GPU implementation of the WSM6 scheme is discussed. The data parallel method is employed to exploit the massive fine-grained parallelism. The CUDA programming model is used to convert the original WSM6 module into GPU programs. To achieve high computational performance, mapping horizontal domain onto an optimal block size is proposed. The experimental results demonstrate that the GPU version obtains over 140x speedup compared with the CPU serial version, and is an efficient parallel implementation.