Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
State-of-the-art in heterogeneous computing
Scientific Programming
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
Journal of Computational Physics
Asynchronous Communication Schemes for Finite Difference Methods on Multiple GPUs
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Assessment of GPU computational enhancement to a 2D flood model
Environmental Modelling & Software
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hardware/software co-design for energy-efficient seismic modeling
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient probabilistic and geometric anatomical mapping using particle mesh approximation on GPUs
Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
The Journal of Supercomputing
Cardiac simulation on multi-GPU platform
The Journal of Supercomputing
Automatic communication optimizations through memory reuse strategies
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Shallow water simulations on multiple GPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Revisiting finite difference and spectral migration methods on diverse parallel architectures
Computers & Geosciences
Fast seismic modeling and reverse time migration on a graphics processing unit cluster
Concurrency and Computation: Practice & Experience
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
International Journal of High Performance Computing Applications
Profile-guided floating- to fixed-point conversion for hybrid FPGA-processor applications
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Modeling irregular boundary with transmission-line modeling method
International Journal of RF and Microwave Computer-Aided Engineering
Multi-GPU implementation of the lattice Boltzmann method
Computers & Mathematics with Applications
Acceleration of stable TTI P-wave reverse-time migration with GPUs
Computers & Geosciences
GPU accelerated flow solver for direct numerical simulation of turbulent flows
Journal of Computational Physics
Direct numerical simulation of turbulence using GPU accelerated supercomputers
Journal of Computational Physics
Automating resource optimisation in reconfigurable design (abstract only)
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Multi-level parallelism for incompressible flow computations on GPU clusters
Parallel Computing
A peta-scalable CPU-GPU algorithm for global atmospheric simulations
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel electronic structure calculations using multiple graphics processing units (GPUs)
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Steering and in-situ visualization for simulation of seismic wave propagation on graphics cards
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Vectorized higher order finite difference kernels
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Memory reuse optimizations in the R-Stream compiler
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
3D seismic reverse time migration on GPGPU
Computers & Geosciences
Efficient 3D stencil computations using CUDA
Parallel Computing
Approximate inference for spatial functional data on massively parallel processors
Computational Statistics & Data Analysis
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
International Journal of Parallel Programming
Hi-index | 0.01 |
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.