A compilation method for communication—efficient partitioning of DOALL loops

Authors:
Santosh Pande;Tareq Bali
Affiliations:
-;-
Venue:
Compiler optimizations for scalable parallel systems
Year:
2001

Citing 31
Cited 0

Process decomposition through locality of reference

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
A report on the Sisal language project

Journal of Parallel and Distributed Computing - Special issue: data-flow processing
Optimization of array accesses by collective loop transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Automatic array alignment in data-parallel programs

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Program partitioning for NUMA multiprocessor computer systems

Journal of Parallel and Distributed Computing - Special issue on performance of supercomputers
Communication-free hyperplane partitioning of nested loops

Journal of Parallel and Distributed Computing
Compiling Fortran 90D/HPF for distributed memory MIMD computers

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
Data and task alignment in distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
An approach to communication-efficient data redistribution

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiling Functional Parallelism on Distributed-Memory Systems

IEEE Parallel & Distributed Technology: Systems & Technology
A Scalable Scheduling Scheme for Functional Parallelism on Distributed Memory Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Data and workload distribution in a multithreaded architecture

Journal of Parallel and Distributed Computing
Compiling Communication-Efficient Programs for Massively Parallel Machines

IEEE Transactions on Parallel and Distributed Systems
Compiling Global Name-Space Parallel Loops for Distributed Execution

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines

IEEE Transactions on Parallel and Distributed Systems
Automatic Extraction of Functional Parallelism from Ordinary Programs

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
On the Granularity and Clustering of Directed Acyclic Task Graphs

IEEE Transactions on Parallel and Distributed Systems
Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic Array Privatization

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Communication-Free Parallelization via Affine Transformations

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Solving Alignment Using Elementary Linear Algebra

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
A New Approach to Array Redistribution: Strip Mining Redistribution

PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
Automatic Data Layout Using 0-1 Integer Programming

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Impact of Load Balancing on Unstructured Adaptive Grid Computations for Distributed-Memory Multiprocessors

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Compilation Techniques for Optimizing Communication on Distributed-Memory Systems

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Optimizing Data Decomposition for Data Parallel Programs

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation loadba lance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops [6]. But a large category of DOALL loops inevitably result in communication and the tradeoffs between computation and communication must be carefully analyzed for those loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+ communication load balanced partitioning through static data and iteration space distribution. First, code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and re-used eliminating a large communication volume. A new larger partition owns rule is formulated to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller non-compute intensive partition. A Partition Interaction Graph is then constructedt hat is used to merge the partitions to achieve granularity adjustment, computation+communication load balance and mapping on the actual number of available processors. Relevant theory anda lgorithms are developed along with a performance evaluation on Cray T3D.