Automatic parallel code generation for tiled nested loops

Authors:
Georgios Goumas;Nikolaos Drosinos;Maria Athanasaki;Nectarios Koziris
Affiliations:
National Technical University of Athens;National Technical University of Athens;National Technical University of Athens;National Technical University of Athens
Venue:
Proceedings of the 2004 ACM symposium on Applied computing
Year:
2004

Citing 21
Cited 6

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
(Pen)-ultimate tiling?

Integration, the VLSI Journal
The Omega Library interface guide

The Omega Library interface guide
Beyond unimodular transformations

The Journal of Supercomputing
Communication-minimal tiling of uniform dependence loops

Journal of Parallel and Distributed Computing
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs

Journal of Parallel and Distributed Computing
Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines

Journal of Parallel and Distributed Computing
Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
Generating efficient tiled code for distributed memory machines

Parallel Computing
Automatic code generation for executing tiled nested loops onto parallel architectures

Proceedings of the 2002 ACM symposium on Applied computing
Partitioning and Mapping Nested Loops on Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Partitioning and Labeling of Loops by Unimodular Transformations

IEEE Transactions on Parallel and Distributed Systems
Loop Transformation Using Nonunimodular Matrices

IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Compiling Tiled Iteration Spaces for Clusters

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing

The MHETA Execution Model for Heterogeneous Clusters

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing code parallelization through a constraint network based approach

Proceedings of the 43rd annual Design Automation Conference
Slicing based code parallelization for minimizing inter-processor communication

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Data locality and parallelism optimization using a constraint-based approach

Journal of Parallel and Distributed Computing
Automatic performance optimization of the discrete fourier transform on distributed memory computers

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an overview of our work, concerning a complete end-to-end framework for automatically generating message passing parallel code for tiled nested for-loops. It considers general parallelepiped tiling transformations and general convex iteration spaces. We address all problems regarding both the generation of sequential tiled code and its parallelization. We have implemented our techniques in a tool which automatically generates MPI parallel code and conducted several series of experiments, concerning the compilation time of our tool, the efficiency of the generated code and the speedup attained on a cluster of PCs. Apart from confirming the value of our techniques, our experimental results show the merit of general parallelepiped tiling transformations and verify previous theoretical work on scheduling-optimal tile shapes.