High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

  • Authors:
  • Dimitri Komatitsch;Gordon Erlebacher;Dominik Göddeke;David Michéa

  • Affiliations:
  • Université de Pau et des Pays de l'Adour, CNRS & INRIA Magique-3D, Laboratoire de Modéélisation et d'Imagerie en Géosciences UMR 5212, Avenue de l'Université, 64013 Pau Ce ...;Department of Scientific Computing, Florida State University, Tallahassee 32306, USA;Institut für Angewandte Mathematik, TU Dortmund, Germany;Université de Pau et des Pays de l'Adour, CNRS & INRIA Magique-3D, Laboratoire de Modéélisation et d'Imagerie en Géosciences UMR 5212, Avenue de l'Université, 64013 Pau Ce ...

  • Venue:
  • Journal of Computational Physics
  • Year:
  • 2010

Quantified Score

Hi-index 31.47

Visualization

Abstract

We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.