Optimization of a Computational Fluid Dynamics Code for the Memory Hierarchy: A Case Study

Authors:
Thomas Hauser;Raymond Lebeau
Affiliations:
ACADEMIC & RESEARCH TECHNOLOGIES, NORTHWESTERN UNIVERSITY,1970 CAMPUS DRIVE, EVANSTON, IL, 60208, USA;PHYSICS AND ASTRONOMY DEPARTMENT, UNIVERSITY OF KENTUCKY,LEXINGTON, KY, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2010

Citing 9
Cited 0

Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software
Code Optimizations for Complex Microprocessors Applied to CFD Software

SIAM Journal on Scientific Computing
Advances in the TAU performance system

Performance analysis and grid computing
A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
Performance Analysis of Leading HPC Architectures With Beambeam3D

International Journal of High Performance Computing Applications
Scientific Application Performance On Leading Scalar and Vector Supercomputering Platforms

International Journal of High Performance Computing Applications
Comparison of implementations of the lattice-Boltzmann method

Computers & Mathematics with Applications
What can performance counters do for memory subsystem analysis?

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the current shift of increasing the computational power of a processor by including multiple cores instead of increasing the clock frequency, consideration of computational efficiency is gaining increased importance for computational fluid dynamics codes. This is especially critical for applications that require high throughput. For example, applying computational fluid dynamics simulations to multi-disciplinary design optimization requires a large number of similar simulations with different input parameters. Therefore, a reduction in the runtime of the code can lead to large reduction in the design process. In our case study, a two-dimensional, block-structured computational fluid dynamics code was optimized for performance on machines with hierarchical memory systems. This paper illustrates the techniques applied to transform an initial version of the code to an optimized version that yielded performance improvements of 10% for very small cases to about 50% for large test cases that did not fit into the cache memory of the target processor. A detailed performance analysis of the code starting at the global level down to subroutines and data structures is presented in this paper. The performance improvements can be explained through a reduction of cache misses in all levels of the memory hierarchy. The L1 cache misses were reduced by about 50%, the L2 cache misses by about 80% and the translation lookaside buffer misses by about 90% for the optimized version of the code. The code performance was also evaluated for multi-core processors, where efficiency is especially important when several instances of an application are running simultaneously. In this case, the most optimized version, a blocked version of the optimized code, more effectively maintained efficiency as more cores were activated compared to the unblocked version. This illustrates that optimizing cache performance may be increasingly important as the number of cores per processor continues to rise.