Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Authors:
Samuel Williams;Jonathan Carter;Leonid Oliker;John Shalf;Katherine Yelick
Affiliations:
CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720, United States and CS Division, University of California at Berkeley, Berkeley, CA 94720, United States;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720, United States;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720, United States;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720, United States;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720, United States and CS Division, University of California at Berkeley, Berkeley, CA 94720, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2009

Citing 10
Cited 8

Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
Lattice Boltzmann method for 3-D flows with curved boundary

Journal of Computational Physics
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
Leading Computational Methods on Scalar and Vector HEC Platforms

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Auto-tuning performance on multicore computers

Auto-tuning performance on multicore computers
A performance evaluation of the cray x1 for scientific applications

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science

Optimizing a parallel runtime system for multicore clusters: a case study

Proceedings of the 2010 TeraGrid Conference
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Load balancing for regular meshes on SMPs with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs

Computers & Mathematics with Applications
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

The Journal of Supercomputing
Performance modeling for FPGAs: extending the roofline model with high-level synthesis tools

International Journal of Reconfigurable Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the high-performance computing (HPC) literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well as a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15 times improvement compared with the original code at a given concurrency. Additionally, we present a detailed analysis of each optimization, which reveals surprising hardware bottlenecks and software challenges for future multicore systems and applications.