Optimization of a Computational Fluid Dynamics Code for the Memory Hierarchy: A Case Study

  • Authors:
  • Thomas Hauser;Raymond Lebeau

  • Affiliations:
  • ACADEMIC & RESEARCH TECHNOLOGIES, NORTHWESTERN UNIVERSITY,1970 CAMPUS DRIVE, EVANSTON, IL, 60208, USA;PHYSICS AND ASTRONOMY DEPARTMENT, UNIVERSITY OF KENTUCKY,LEXINGTON, KY, USA

  • Venue:
  • International Journal of High Performance Computing Applications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the current shift of increasing the computational power of a processor by including multiple cores instead of increasing the clock frequency, consideration of computational efficiency is gaining increased importance for computational fluid dynamics codes. This is especially critical for applications that require high throughput. For example, applying computational fluid dynamics simulations to multi-disciplinary design optimization requires a large number of similar simulations with different input parameters. Therefore, a reduction in the runtime of the code can lead to large reduction in the design process. In our case study, a two-dimensional, block-structured computational fluid dynamics code was optimized for performance on machines with hierarchical memory systems. This paper illustrates the techniques applied to transform an initial version of the code to an optimized version that yielded performance improvements of 10% for very small cases to about 50% for large test cases that did not fit into the cache memory of the target processor. A detailed performance analysis of the code starting at the global level down to subroutines and data structures is presented in this paper. The performance improvements can be explained through a reduction of cache misses in all levels of the memory hierarchy. The L1 cache misses were reduced by about 50%, the L2 cache misses by about 80% and the translation lookaside buffer misses by about 90% for the optimized version of the code. The code performance was also evaluated for multi-core processors, where efficiency is especially important when several instances of an application are running simultaneously. In this case, the most optimized version, a blocked version of the optimized code, more effectively maintained efficiency as more cores were activated compared to the unblocked version. This illustrates that optimizing cache performance may be increasingly important as the number of cores per processor continues to rise.