An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSMM

Authors:
Angeles G. Navarro;Emilio L. Zapata
Affiliations:
-;-
Venue:
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Year:
1999

Citing 7
Cited 1

Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Simplification of array access patterns for compiler optimizations

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Dynamic data distribution with control flow analysis

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Parallel Programming with Polaris

Computer
Solving Alignment Using Elementary Linear Algebra

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Automatic Data Layout Using 0-1 Integer Programming

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Access Descriptor based Locality Analysis for Distributed-Shared Memory Multiprocessors

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing

Simulation as a tool for optimizing memory accesses on NUMA machines

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays NUMA architectures are widely accepted. For such multiprocessors exploiting data locality is clearly a key issue. In this work, we present a method for automatically selecting the iteration/data distributions for a sequential F77 code, while minimizing the parallel execution overhead (communications and load unbalance). We formulate an integer programming problem to achieve that minimum parallel overhead. The constraints of the integer programming problem are derived directly from a graph known as the Locality-Communication Graph (LCG), which captures the memory locality, as well as the communication patterns, of a parallel program. In addition, our approach use the LCG to automatically schedule the communication operations required during the program execution, once the iteration/data distributions have been selected. The aggregation of messages in blocks is also dealt in our approach. The TFFT2 code, from NASA benchmarks, that includes nonaffine access functions and non-affine index bounds, and repeated subroutine calls inside loops, has been correctly handled by our approach. With the iteration/data distributions derived from our method, this code achieves parallel efficiencies of over 69% for 16 processors, in a Cray T3E, an excellent performance for a complex real code.