A practical automatic polyhedral parallelizer and locality optimizer

Authors:
Uday Bondhugula;Albert Hartono;J. Ramanujam;P. Sadayappan
Affiliations:
The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;Louisiana State University, Baton Rouge, LA, USA;The Ohio State University, Columbus, OH, USA
Venue:
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Year:
2008

Citing 42
Cited 81

Theory of linear and integer programming

Theory of linear and integer programming
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A practical algorithm for exact array dependence analysis

Communications of the ACM
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
A singular loop transformation framework based on non-singular matrices

International Journal of Parallel Programming
(Pen)-ultimate tiling?

Integration, the VLSI Journal
A unifying framework for iteration reordering transformations

A unifying framework for iteration reordering transformations
Communication-minimal tiling of uniform dependence loops

Journal of Parallel and Distributed Computing
Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs

International Journal of Parallel Programming
Loop parallelization algorithms: from parallelism extraction to code generation

Parallel Computing - Special issues on languages and compilers for parallel computers
Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Loop tiling for parallelism

Loop tiling for parallelism
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Scheduling and Automatic Parallelization

Scheduling and Automatic Parallelization
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Code generation for multiple mappings

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Code Generation in the Polytope Model

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Automatic Blocking of Nested Loops

Automatic Blocking of Nested Loops
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Facilitating the search for compositions of program transformations

Proceedings of the 19th annual international conference on Supercomputing
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

International Journal of Parallel Programming
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Violated dependence analysis

Proceedings of the 20th annual international conference on Supercomputing
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Proceedings of the International Symposium on Code Generation and Optimization
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Improving data locality by chunking

CC'03 Proceedings of the 12th international conference on Compiler construction
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Polyhedral code generation in the real world

CC'06 Proceedings of the 15th international conference on Compiler Construction
Optimal semi-oblique tiling

IEEE Transactions on Parallel and Distributed Systems

A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Compact multi-dimensional kernel extraction for register tiling

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Improving parallelism and locality with asynchronous algorithms

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Cache oblivious parallelograms in iterative stencil computations

Proceedings of the 24th ACM International Conference on Supercomputing
Processor virtualization and split compilation for heterogeneous multicore embedded systems

Proceedings of the 47th Design Automation Conference
A model for fusion and code motion in an automatic parallelizing compiler

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic parallelization of recursive code: part 1: managing control flow interactions with the continuator

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
now happens-before later: static schedule analysis of fine-grained parallelism with explicit happens-before relationships

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Multithreaded Geant4: semi-automatic transformation into scalable thread-parallel software

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Data locality and parallelism optimization using a constraint-based approach

Journal of Parallel and Distributed Computing
Loop transformations: convexity, pruning and optimization

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Time skewing made simple

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Polyhedral Model Based Data Locality Optimization for Embedded Applications

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Automatic generation of fpga-specific pipelined accelerators

ARC'11 Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications
Fault oblivious eXascale whitepaper

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Coarse-grained loop parallelization: Iteration Space Slicing vs affine transformations

Parallel Computing
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Adaptive runtime selection of parallel schedules in the polytope model

Proceedings of the 19th High Performance Computing Symposia
Polyhedral parallelization of binary code

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Adapting the polyhedral model as a framework for efficient speculative parallelization

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Efficient tiled loop generation: D-tiling

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
The polyhedral model is more widely applicable than you think

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Predictive modeling in a polyhedral optimization space

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Optimizing I/O for big array analytics

Proceedings of the VLDB Endowment
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience
Automatic privatization for parallel execution of loops

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Defensive loop tiling for multi-core processor

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Scan detection and parallelization in "inherently sequential" nested loop programs

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hierarchical overlapped tiling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
VMAD: an advanced dynamic program analysis and instrumentation framework

CC'12 Proceedings of the 21st international conference on Compiler Construction
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Distributed Shared Memory and Compiler-Induced Scalable Locality for Scalable Cluster Performance

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Using free scheduling for programming graphic cards

Facing the Multicore-Challenge II
Free scheduling for statement instances of parameterized arbitrarily nested affine loops

Parallel Computing
Automatic OpenMP loop scheduling: a combined compiler and runtime approach

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
Layout-oblivious optimization for matrix computations

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automatic extraction of multi-objective aware pipeline parallelism using genetic algorithms

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A data dependence test based on the projection of paths over shape graphs

Journal of Parallel and Distributed Computing
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Code generation for parallel execution of a class of irregular loops on distributed memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
FPGA-specific synthesis of loop-nests with pipelined computational cores

Microprocessors & Microsystems
Layout-oblivious compiler optimization for matrix computations

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Improved loop tiling based on the removal of spurious false dependences

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
From serial loops to parallel execution on distributed systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Improving high level synthesis optimization opportunity through polyhedral transformations

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
C-to-CoRAM: compiling perfect loop nests to the portable CoRAM abstraction

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Data layout optimization for GPGPU architectures

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic speculative parallelization of loops using polyhedral dependence analysis

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Polyhedral model based mapping optimization of loop nests for CGRAs

Proceedings of the 50th Annual Design Automation Conference
Fine-grained multi-phase array designs

Journal of Parallel and Distributed Computing
Semantics-preserving data layout transformations for improved vectorisation

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Generating efficient data movement code for heterogeneous architectures with distributed-memory

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Self adaptive run time scheduling for the automatic parallelization of loops with the C2µTC/SL compiler

Parallel Computing
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

ACM Transactions on Architecture and Code Optimization (TACO)
Improving polyhedral code generation for high-level synthesis

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Automatic extraction of pipeline parallelism for embedded heterogeneous multi-core platforms

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Recovering memory access patterns of executable programs

Science of Computer Programming
A Case Study of Implementing Supernode Transformations

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model -- far beyond what is possible by current production compilers. Unlike previous works, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high speedups for local and parallel execution on multi-cores over state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.