Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Authors:
Muthu Manikandan Baskaran;Nagavijayalakshmi Vydyanathan;Uday Kumar Reddy Bondhugula;J. Ramanujam;Atanas Rountev;P. Sadayappan
Affiliations:
The Ohio state University, Columbus, OH, USA;The Ohio state University, Columbus, OH, USA;The Ohio state University, Columbus, OH, USA;Louisiana State University, Baton Rouge, LA, USA;The Ohio state University, Columbus, OH, USA;The Ohio state University, Columbus, OH, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 34
Cited 16

Compile-time partitioning and scheduling of parallel programs

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Run-time scheduling and execution of loops on message passing machines

Journal of Parallel and Distributed Computing - Special issue: algorithms for hypercube computers
Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel processing of near fine grain tasks using static scheduling OSCAR (optimally scheduled advanced multiprocessor)

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A practical algorithm for exact array dependence analysis

Communications of the ACM
Improving the performance of runtime parallelization

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Runtime compilation techniques for data partitioning and communication schedule reuse

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs

International Journal of Parallel Programming
Loop parallelization algorithms: from parallelism extraction to code generation

Parallel Computing - Special issues on languages and compilers for parallel computers
Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
A comparison of list schedules for parallel processing systems

Communications of the ACM
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Partitioning and Scheduling Parallel Programs for Multiprocessors

Partitioning and Scheduling Parallel Programs for Multiprocessors
An efficient algorithm for the run-time parallelization of DOACROSS loops

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
On the Granularity and Clustering of Directed Acyclic Task Graphs

IEEE Transactions on Parallel and Distributed Systems
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
Toward efficient and robust software speculative parallelization on multiprocessors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving parallelism and data locality with affine partitioning

Improving parallelism and data locality with affine partitioning
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

International Journal of Parallel Programming
Violated dependence analysis

Proceedings of the 20th annual international conference on Supercomputing
Sensitivity analysis for automatic parallelization on multi-cores

Proceedings of the 21st annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Polyhedral code generation in the real world

CC'06 Proceedings of the 15th international conference on Compiler Construction

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Multithreaded Geant4: semi-automatic transformation into scalable thread-parallel software

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
A hybrid approach of OpenMP for clusters

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatically tuning parallel and parallelized programs

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Programming with intervals

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Parcae: a system for flexible parallel execution

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Automatic OpenMP loop scheduling: a combined compiler and runtime approach

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
From serial loops to parallel execution on distributed systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Sigma*: symbolic learning of input-output specifications

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
Proof-Directed Parallelization Synthesis by Separation Logic

ACM Transactions on Programming Languages and Systems (TOPLAS)
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Self adaptive run time scheduling for the automatic parallelization of loops with the C2µTC/SL compiler

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.