Model-guided empirical optimization for memory hierarchy

  • Authors:
  • Mary Hall;Chun Chen

  • Affiliations:
  • University of Southern California;University of Southern California

  • Venue:
  • Model-guided empirical optimization for memory hierarchy
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We are facing an increasing performance gap between processor and memory speed on today's architectures. To bridge this performance gap, various architectural features, such as SIMD and scalar registers, and multiple levels of cache, are commonly found on today's high-performance computers. Exploiting such features and managing their complex interactions pose a serious challenge for software running on these architectures to achieve the best performance. Therefore, it has been increasingly difficult for compilers to statically select the best optimizations among a large number of code transformations and parameter choices. The result is that compiler-optimized codes often achieve performance well below the best manually-tuned codes. Moreover, today's compilers are ineffective in transforming complex loop nests. Existing approaches are either complicated to apply or difficult to integrate with other loop transformations. As a result, compilers often cannot generate code with the same quality as manually tuned when optimizing for such loop nests. We propose in this dissertation a new compiler approach for optimizing for the complete memory hierarchy. Our approach combines compiler analyses and models with guided empirical search to take advantage of their complementary strengths. The analyses and models limit the search to a small number of candidate optimized codes, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. This research makes the following contributions. First, to support complex loop constructs, we develop a loop transformation framework that can automatically generate high-quality codes. Second, we combine this framework with required analyses and optimization strategies targeting multiple levels of the memory hierarchy. To facilitate the empirical search, each code variant generated from compiler analyses is expressed as a script, in which transformation parameters like tile sizes can be adjusted. Then the transformed code is generated from the script and run on the target machine empirically. Finally, we have implemented the above compiler framework. Experimental results on the Pentium M and SGI R10000 show that our approach can achieve performance comparable with the best manually-tuned codes, and significantly better (up to 11x speedup) than existing compiler approaches.