Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems

Authors:
S. Donath;K. Iglberger;G. Wellein;T. Zeiser;A. Nitsure;U. Rude
Affiliations:
System Simulation – Computer Science 10 (LSS), University of Erlangen-Nuremberg, Germany.;System Simulation – Computer Science 10 (LSS), University of Erlangen-Nuremberg, Germany.;Regional Computing Center of Erlangen (RRZE), University of Erlangen-Nuremberg, Germany.;Regional Computing Center of Erlangen (RRZE), University of Erlangen-Nuremberg, Germany.;Regional Computing Center of Erlangen (RRZE), University of Erlangen-Nuremberg, Germany.;System Simulation – Computer Science 10 (LSS), University of Erlangen-Nuremberg, Germany
Venue:
International Journal of Computational Science and Engineering
Year:
2008

Citing 3
Cited 3

Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing

Coupling multibody dynamics and computational fluid dynamics on 8192 processor cores

Parallel Computing
Direct Numerical Simulation of Particulate Flows on 294912 Processor Cores

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this report, we discuss the performance behaviour of different parallel lattice Boltzmann implementations. In previous works, we already proposed a fast serial implementation and a cache oblivious spatial and temporal blocking algorithm for the lattice Boltzmann method (LBM) in three spatial dimensions. The cache oblivious update scheme has originally been proposed by Frigo et al. The main idea is to provide maximum performance results for stencil-based methods by dividing the space-time domain in an optimal way, independently of any external parameters, such as cache size. In view of the increasing gap between processor speed and memory performance, this approach offers a promising path to increase cache utilisation. We present results for the shared memory parallelisation of the cache oblivious implementation based on task queueing in comparison to the iterative standard implementation, thereby focusing on the special issues for multi-core and multi-socket systems.