Accurate three-dimensional lid-driven cavity flow
Journal of Computational Physics
SPEC CPU2006 benchmark descriptions
ACM SIGARCH Computer Architecture News
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
TeraFLOP computing on a desktop PC with GPUs for 3D CFD
International Journal of Computational Fluid Dynamics - Mesoscopic Methods And Their Applications To CFD
Exploring New Architectures in Accelerating CFD for Air Force Applications
HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference
LBM based flow simulation using GPU computing processor
Computers & Mathematics with Applications
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Editorial: Mesoscopic methods in engineering and science
Computers & Mathematics with Applications
The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method
International Journal of High Performance Computing Applications
Multi-GPU implementation of the lattice Boltzmann method
Computers & Mathematics with Applications
Efficient GPU implementation of the linearly interpolated bounce-back boundary condition
Computers & Mathematics with Applications
Computers & Mathematics with Applications
GPU accelerated lattice Boltzmann simulation for rotational turbulence
Computers & Mathematics with Applications
Computers & Mathematics with Applications
Recent progress and challenges in exploiting graphics processors in computational fluid dynamics
The Journal of Supercomputing
Hi-index | 0.09 |
Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia's GT200. We show that as a consequence highly efficient implementations of LBM on GPUs are possible, even for complex models.