Computer Methods in Applied Mechanics and Engineering
A general approach to nonlinear FE computations on shared-memory multiprocessors
Computer Methods in Applied Mechanics and Engineering
Spectral element method for acoustic wave simulation in heterogeneous media
Finite Elements in Analysis and Design - Special issue: selection of papers presented at ICOSAHOM'92
Nonlinear dynamic finite element analysis on parallel computers using FORTRAN 90 and MPI
Advances in Engineering Software - Special issue; special issue on large-scale analysis and design on high-performance computers and workstations
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
An analysis of the discontinuous Galerkin method for wave propagation problems
Journal of Computational Physics
A generalized diagonal mass matrix spectral element method for non-quadrilateral elements
Proceedings of the fourth international conference on Spectral and high order methods (ICOSAHOM 1998)
Explicit Finite Element Methods for Symmetric Hyperbolic Equations
SIAM Journal on Numerical Analysis
HPCN Europe 1996 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Journal of Computational Physics
GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Discontinuous Galerkin Method for Linear Symmetric Hyperbolic Systems in Inhomogeneous Media
Journal of Scientific Computing
Computer Animation and Virtual Worlds - Special Issue: The Very Best Papers from CASA 2004
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster
Parallel Computing
General purpose molecular dynamics simulations fully implemented on graphics processing units
Journal of Computational Physics
Scalable Parallel Programming with CUDA
Queue - GPU Computing
International Journal of Parallel, Emergent and Distributed Systems
Fast multipole methods on graphics processors
Journal of Computational Physics
Communications of the ACM
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Computing Experiences with CUDA
IEEE Micro
Accelerating linpack with CUDA on heterogenous clusters
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Journal of Parallel and Distributed Computing
Message passing on data-parallel architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Efficient and Accurate Sound Propagation Using Adaptive Rectangular Decomposition
IEEE Transactions on Visualization and Computer Graphics
Nodal discontinuous Galerkin methods on graphics processors
Journal of Computational Physics
Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU
International Journal of Computational Science and Engineering
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
CUDASA: compute unified device and systems architecture
EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Journal of Computational Physics
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
FLAT: a GPU programming framework to provide embedded MPI
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Simulation of multistage excavation based on a 3D spectral-element method
Computers and Structures
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems
Journal of Parallel and Distributed Computing
FastMat: A C++ library for multi-index array computations
Advances in Engineering Software
The Journal of Supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel 3D-TLM algorithm for simulation of the Earth-ionosphere cavity
Journal of Computational Physics
Journal of Computational Physics
GPU parallelization of a three dimensional marine CSEM code
Computers & Geosciences
A GPU parallelized spectral method for elliptic equations in rectangular domains
Journal of Computational Physics
Population-based harmony search using GPU applied to protein structure prediction
International Journal of Computational Science and Engineering
Numerical integration on GPUs for higher order finite elements
Computers & Mathematics with Applications
Hi-index | 31.47 |
We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.