Effective automatic parallelization and locality optimization using the polyhedral model

  • Authors:
  • P. Sadayappan;Uday Kumar Reddy Bondhugula

  • Affiliations:
  • The Ohio State University;The Ohio State University

  • Venue:
  • Effective automatic parallelization and locality optimization using the polyhedral model
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multicore processors have now become mainstream. The difficulty of programming these architectures to effectively tap the potential of multiple processing units is wellknown. Among several ways of addressing this issue, one of the very promising and simultaneously hard approaches is Automatic Parallelization. This approach does not require any effort on part of the programmer in the process of parallelizing a program. The Polyhedral model for compiler optimization is a powerful mathematical framework based on parametric linear algebra and integer linear programming. It provides an abstraction to represent nested loop computation and its data dependences using integer points in polyhedra. Complex execution-reordering, that can improve performance by parallelization as well as locality enhancement, is captured by affine transformations in the polyhedral model. With several recent advances, the polyhedral model has reached a level of maturity in various aspects—in particular, as a powerful intermediate representation for performing transformations, and code generation after transformations. However, an approach to automatically find good transformations for communication-optimized coarse-grained parallelization together with locality optimization has been a key missing link. This dissertation presents a new automatic transformation framework that solves the above problem. Our approach works by finding good affine transformations through a powerful and practical linear cost function that enables efficient tiling and fusion of sequences of arbitrarily nested loops. This in turn allows simultaneous optimization for coarse-grained parallelism and locality. Synchronization-free parallelism and pipelined parallelism at various levels can be extracted. The framework can be targeted to different parallel architectures, like general-purpose multicores, the Cell processor, GPUs, or embedded multiprocessor SoCs. The proposed framework has been implemented into a new end-to-end transformation tool, PLUTO, that can automatically generate parallel code from regular C program sections. Experimental results from the implemented system show significant performance improvement for single core and multicore execution over state-of-the-art research compiler frameworks as well as the best native production compilers. For several dense linear algebra kernels, code generated from Pluto beats, by a significant margin, the same kernels implemented with sequences of calls to highly-tuned libraries supplied by vendors. The system also allows empirical optimization to be performed in a much wider context than has been attempted previously. In addition, Pluto can serve as the parallel code generation backend for several high-level domain-specific languages.