Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software
Code Optimizations for Complex Microprocessors Applied to CFD Software
SIAM Journal on Scientific Computing
Advances in the TAU performance system
Performance analysis and grid computing
A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
Performance Analysis of Leading HPC Architectures With Beambeam3D
International Journal of High Performance Computing Applications
Scientific Application Performance On Leading Scalar and Vector Supercomputering Platforms
International Journal of High Performance Computing Applications
Comparison of implementations of the lattice-Boltzmann method
Computers & Mathematics with Applications
What can performance counters do for memory subsystem analysis?
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Hi-index | 0.00 |
With the current shift of increasing the computational power of a processor by including multiple cores instead of increasing the clock frequency, consideration of computational efficiency is gaining increased importance for computational fluid dynamics codes. This is especially critical for applications that require high throughput. For example, applying computational fluid dynamics simulations to multi-disciplinary design optimization requires a large number of similar simulations with different input parameters. Therefore, a reduction in the runtime of the code can lead to large reduction in the design process. In our case study, a two-dimensional, block-structured computational fluid dynamics code was optimized for performance on machines with hierarchical memory systems. This paper illustrates the techniques applied to transform an initial version of the code to an optimized version that yielded performance improvements of 10% for very small cases to about 50% for large test cases that did not fit into the cache memory of the target processor. A detailed performance analysis of the code starting at the global level down to subroutines and data structures is presented in this paper. The performance improvements can be explained through a reduction of cache misses in all levels of the memory hierarchy. The L1 cache misses were reduced by about 50%, the L2 cache misses by about 80% and the translation lookaside buffer misses by about 90% for the optimized version of the code. The code performance was also evaluated for multi-core processors, where efficiency is especially important when several instances of an application are running simultaneously. In this case, the most optimized version, a blocked version of the optimized code, more effectively maintained efficiency as more cores were activated compared to the unblocked version. This illustrates that optimizing cache performance may be increasingly important as the number of cores per processor continues to rise.