Pipelined functional tree accesses and updates: scheduling, synchronization, caching and coherence

Authors:
Andrew J. Bennett;Paul H. J. Kelly;Ross A. Paterson
Affiliations:
Department of Computing, Imperial College, 180 Queen's Gate, London SW7 2BZ, UK (e-mail: p.kelly@doc.ic.ac.uk);Department of Computing, Imperial College, 180 Queen's Gate, London SW7 2BZ, UK (e-mail: p.kelly@doc.ic.ac.uk);Department of Computer Science, City University, London, UK
Venue:
Journal of Functional Programming
Year:
2001

Citing 35
Cited 0

The promotion and accumulation strategies in transformational programming

ACM Transactions on Programming Languages and Systems (TOPLAS) - Lecture notes in computer science Vol. 174
Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Deriving sequential and parallel programs from pure LISP specifications by program transformation

The IFIP TC2/WG 2.1 Working Conference on Program specification and transformation
GRIP—A high-performance architecture for parallel graph reduction

Proc. of a conference on Functional programming languages and computer architecture
The Balance Multiprocessor System

IEEE Micro
Parallel implementations of functional programming languages

The Computer Journal - Special issue on Lazy functional programming
Parallel graph reduction with the (v , G)-machine

FPCA '89 Proceedings of the fourth international conference on Functional programming languages and computer architecture
Cache behavior of combinator graph reduction

ACM Transactions on Programming Languages and Systems (TOPLAS)
Report on the programming language Haskell: a non-strict, purely functional language version 1.2

ACM SIGPLAN Notices - Haskell special issue
Benchmarking implementations of lazy functional languages

FPCA '93 Proceedings of the conference on Functional programming languages and computer architecture
Message passing on the Meiko CS-2

Parallel Computing - Special issue: message passing interfaces
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
GUM: a portable parallel implementation of Haskell

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Efficient shared-memory support for parallel graph reduction

Future Generation Computer Systems
Empirical studies of competitve spinning for a shared-memory multiprocessor

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A Transformation System for Developing Recursive Programs

Journal of the ACM (JACM)
A note on conditional expressions

Communications of the ACM
Functional Programming for Loosely-Coupled Multiprocessors

Functional Programming for Loosely-Coupled Multiprocessors
An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
An Implementation of Static Functional Process Networks

PARLE '92 Proceedings of the 4th International PARLE Conference on Parallel Architectures and Languages Europe
Processing Transactions on GRIP, a Parallel Graph Reducer

PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Parallel Programming Using Skeleton Functions

PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Localtiy and False Sharing in Coherent-Cache Parallel Graph Reduction

PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Eliminating Invalidation in Coherent-Cache Parallel Graph Reduction

PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
M-Tree: A Parallel Abstract Data Type for Block-Irregular Adaptive Applictions

Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Reactive Proxies: A Flexible Protocol Extension to Reduce ccNUMA Node Controller Contention

Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing
Making a Packet: Cost-Effective Communication for a Parallel Graph Reducer

IFL '96 Selected Papers from the 8th International Workshop on Implementation of Functional Languages
Parallelising a Large Functional Program or: Keeping LOLITA Busy

IFL '97 Selected Papers from the 9th International Workshop on Implementation of Functional Languages
Engineering Large Parallel Functional Programs

IFL '97 Selected Papers from the 9th International Workshop on Implementation of Functional Languages
Implementation of multilisp: Lisp on a multiprocessor

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Shared-Memory Multiprocessor Trends and the Implications for Parallel Program Performance

Shared-Memory Multiprocessor Trends and the Implications for Parallel Program Performance
Multiprocessor execution of functional programs

Multiprocessor execution of functional programs
A functional database (databases)

A functional database (databases)
Algorithm + strategy = parallelism

Journal of Functional Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is an exploration of the parallel graph reduction approach to parallel functional programming, illustrated by a particular example: pipelined, dynamically-scheduled implementation of search, updates and read-modify-write transactions on an in-store binary search tree. We use program transformation, execution-driven simulation and analytical modelling to expose the maximum potential parallelism, the minimum communication and synchronisation overheads, and to control the overall space requirement. We begin with a lazy functional program specifying a series of transactions on a binary tree, each involving several searches and updates, in a side-effect-free fashion. Transformation of the source code produces a formulation of the program with greater locality and larger grain size than can be achieved using naive parallelization methods, and we show that, with care, these tasks can be scheduled effectively. Even with a workload using random keys, significant spatial locality is found, and we evaluate a modified cache coherency protocol which avoids false sharing so that large cache lines can be used to minimise the number of messages required. As expected with a pipeline, the application should reach a steady state as soon as the first transaction is completed. However, if the network latency is too large, the rate of completion lags behind the rate at which work is admitted, and internal queues grow without bound. We determine the conditions under which this occurs, and show how it can be avoided while maximising speedup.