Loop Optimization using Hierarchical Compilation and Kernel Decomposition

Authors:
Denis Barthou;Sebastien Donadio;Patrick Carribault;Alexandre Duchateau;William Jalby
Affiliations:
Universite de Versailles Saint-Quentin, France;Bull SA, Les Clayes sous Bois, France;Bull SA, Les Clayes sous Bois, France;LRC ITACA, CEA/DAM and Université de Versailles Saint-Quentin, France;Universite de Versailles Saint-Quentin, France
Venue:
Proceedings of the International Symposium on Code Generation and Optimization
Year:
2007

Citing 13
Cited 1

Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Maximal static expansion

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Automatic storage management for parallel programs

Parallel Computing - Special issues on languages and compilers for parallel computers
Transformations for imperfectly nested loops

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A unified framework for schedule and storage optimization

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Automatic Analytical Modeling for the Estimation of Cache Misses

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
WBTK: a New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing

International Journal of High Performance Computing Applications
A language for the compact representation of multiple program versions

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing

Towards making autotuning mainstream

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing complexity of hardware features for re- cent processors makes high performance code genera- tion very challenging. In particular, several optimiza- tion targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level par- allelism). Very often, these optimization goals impose dif- ferent and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the origi- nal loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix muliply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL).