Model-guided empirical optimization for memory hierarchy

Authors:
Mary Hall;Chun Chen
Affiliations:
University of Southern California;University of Southern California
Venue:
Model-guided empirical optimization for memory hierarchy
Year:
2007

Citing 0
Cited 8

Model-guided autotuning of high-productivity languages for petascale computing

Proceedings of the 18th ACM international symposium on High performance distributed computing
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Auto-tuning full applications: A case study

International Journal of High Performance Computing Applications
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Auto-tuning for energy usage in scientific applications

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Towards making autotuning mainstream

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are facing an increasing performance gap between processor and memory speed on today's architectures. To bridge this performance gap, various architectural features, such as SIMD and scalar registers, and multiple levels of cache, are commonly found on today's high-performance computers. Exploiting such features and managing their complex interactions pose a serious challenge for software running on these architectures to achieve the best performance. Therefore, it has been increasingly difficult for compilers to statically select the best optimizations among a large number of code transformations and parameter choices. The result is that compiler-optimized codes often achieve performance well below the best manually-tuned codes. Moreover, today's compilers are ineffective in transforming complex loop nests. Existing approaches are either complicated to apply or difficult to integrate with other loop transformations. As a result, compilers often cannot generate code with the same quality as manually tuned when optimizing for such loop nests. We propose in this dissertation a new compiler approach for optimizing for the complete memory hierarchy. Our approach combines compiler analyses and models with guided empirical search to take advantage of their complementary strengths. The analyses and models limit the search to a small number of candidate optimized codes, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. This research makes the following contributions. First, to support complex loop constructs, we develop a loop transformation framework that can automatically generate high-quality codes. Second, we combine this framework with required analyses and optimization strategies targeting multiple levels of the memory hierarchy. To facilitate the empirical search, each code variant generated from compiler analyses is expressed as a script, in which transformation parameters like tile sizes can be adjusted. Then the transformed code is generated from the script and run on the target machine empirically. Finally, we have implemented the above compiler framework. Experimental results on the Pentium M and SGI R10000 show that our approach can achieve performance comparable with the best manually-tuned codes, and significantly better (up to 11x speedup) than existing compiler approaches.