Data optimization: allocation of arrays to reduce communication on SIMD machines
Journal of Parallel and Distributed Computing - Massively parallel computation
Parallel programming with coordination structures
POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A configuration approach to parallel programming
PARLE '91 Proceedings on Parallel architectures and languages Europe : volume II: parallel languages: volume II: parallel languages
Algorithmic skeletons: structured management of parallel computation
Algorithmic skeletons: structured management of parallel computation
SIAM Journal on Scientific and Statistical Computing
A methodology for the development and the support of massively parallel programs
Future Generation Computer Systems - Special triple issue: parallel and distributed workstation systems
ICS '94 Proceedings of the 8th international conference on Supercomputing
Compilation and delayed evaluation in APL
POPL '78 Proceedings of the 5th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Parallel Programming Using Skeleton Functions
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Efficient Distributed Memory Implementation of a Data Parallel Functional Language
PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
Data Distribution Algebras - A Formal Basis for Programming Using Skeletons
PROCOMET '94 Proceedings of the IFIP TC2/WG2.1/WG2.2/WG2.3 Working Conference on Programming Concepts, Methods and Calculi
An efficient algorithm for exploiting multiple arithmetic units
IBM Journal of Research and Development
Hi-index | 0.00 |
This paper describes a parallel implementation of a matrix/vector library for C++ for a large distributed-memory multicomputer. The library is "self-optimising" by exploiting lazy evaluation: execution of matrix operations is delayed as much as possible. This exposes the context in which each intermediate result is used. The run-time system extracts a functional representation of the values being computed and optimises data distribution, grain size and scheduling prior to execution. This exploits results in the theory of program transformation for optimising parallel functional programs, while presenting an entirely conventional interface to the programmer. We present details of some of the simple optimisations we have implemented so far and illustrate their effect using a small example. Conventionally, optimisation is confined to compile-time, and compilation is completed before run-time. Many exciting opportunities are lost by this convenient divide. This paper presents one example of such a possibility. We do optimisation at run-time for three important reasons: • We wish to deliver a library which uses parallelism to implement ADTs efficiently, callable from any client program (in any sensible language) without special parallel programming expertise. This means we cannot perform compile-time analysis of the caller's source code. • We wish to perform optimisations which take advantage of how the client program uses the intermediate values. This would be straightforward at compile-time, but not for a library called at run-time. • We wish to take advantage of information available only at run-time, such as the way operations are composed, and the size and characteristics of intermediary data structures. We aim to get much of the performance of compile-time optimisation, possibly more by using run-time information, while retaining the ease with which a library can be installed and used. There is some run-time overhead involved, which limits the scope of the approach.