Efficient Interprocedural Data Placement Optimisation in a Parallel Library

Authors:
Olav Beckmann;Paul H. J. Kelly
Affiliations:
-;-
Venue:
LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Year:
1998

Citing 12
Cited 8

Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
Runtime compilation techniques for data partitioning and communication schedule reuse

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
Interprocedural compilation of Fortran D

Journal of Parallel and Distributed Computing - Special issue on compilation techniques for distributed memory systems
Compiling affine nested loops: how to optimize the residual communications after the alignment phase

Journal of Parallel and Distributed Computing - Special issue on compilation techniques for distributed memory systems
On programming of arithmetic operations

Communications of the ACM
Data Distribution at Run-Time: Re-using Execution Plans

Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing
Performance Driven Programming Models

MPPM '97 Proceedings of the Conference on Massively Parallel Programming Models
High-Level Management of Communication Schedules in HPF-Like Languages

High-Level Management of Communication Schedules in HPF-Like Languages
An efficient algorithm for exploiting multiple arithmetic units

IBM Journal of Research and Development

A Linear Algebra Formulation for Optimising Replication in Data Parallel Programs

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Inter-array Data Regrouping

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Run-Time Fusion of MPI Calls in a Parallel C++ Library

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Is Morton Layout Competitive for Large Two-Dimensional Arrays?

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Optimising Shared Reduction Variables in MPI Programs

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Explicit Dependence Metadata in an Active Visual Effects Library

Languages and Compilers for Parallel Computing
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Science of Computer Programming
A domain-specific interpreter for parallelizing a large mixed-language visualisation application

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a combination of methods which make interprocedural data placement optimisation available to parallel libraries. We propose a delayed-evaluation, self-optimising (DESO) numerical library for a distributed-memory multicomputer. Delayed evaluation allows us to capture the control-flow of a user program from within the library at runtime, and to construct an optimised execution plan by propagating data placement constraints backwards through the DAG representing the computation to be performed. Our strategy for optimising data placements at runtime consists of an efficient representation for data distributions, a greedy optimisation algorithm, which because of delayed evaluation can take account of the full context of operations, and of re-using the results of previous runtime optimisations on contexts we have encountered before. We show performance figures for our library on a cluster of Pentium II Linux workstations, which demonstrate that the overhead of our delayed evaluation method is very small, and which show both the parallel speedup we obtain and the benefit of the optimisations we describe.