Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Simplification of array access patterns for compiler optimizations
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Dynamic data distribution with control flow analysis
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Parallel Programming with Polaris
Computer
Solving Alignment Using Elementary Linear Algebra
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Automatic Data Layout Using 0-1 Integer Programming
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Access Descriptor based Locality Analysis for Distributed-Shared Memory Multiprocessors
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Simulation as a tool for optimizing memory accesses on NUMA machines
Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
Hi-index | 0.00 |
Nowadays NUMA architectures are widely accepted. For such multiprocessors exploiting data locality is clearly a key issue. In this work, we present a method for automatically selecting the iteration/data distributions for a sequential F77 code, while minimizing the parallel execution overhead (communications and load unbalance). We formulate an integer programming problem to achieve that minimum parallel overhead. The constraints of the integer programming problem are derived directly from a graph known as the Locality-Communication Graph (LCG), which captures the memory locality, as well as the communication patterns, of a parallel program. In addition, our approach use the LCG to automatically schedule the communication operations required during the program execution, once the iteration/data distributions have been selected. The aggregation of messages in blocks is also dealt in our approach. The TFFT2 code, from NASA benchmarks, that includes nonaffine access functions and non-affine index bounds, and repeated subroutine calls inside loops, has been correctly handled by our approach. With the iteration/data distributions derived from our method, this code achieves parallel efficiencies of over 69% for 16 processors, in a Cray T3E, an excellent performance for a complex real code.