Designing algorithms on hierarchical memory multiprocessors

Authors:
Luigi Brochard;Alex Freau
Affiliations:
IBM Research, T. J. Watson Research Center, Yorktown Heights, NY;IBM Research, T. J. Watson Research Center, Yorktown Heights, NY
Venue:
ICS '90 Proceedings of the 4th international conference on Supercomputing
Year:
1990

Citing 3
Cited 1

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Domain decomposition in distributed and shared memory environments I: uniform decomposition and performance analysis for the NCUBE and JPL Mark IIIfp hypercubes

Proceedings of the 1st International Conference on Supercomputing
RP3 performance monitoring hardware

Instrumentation for future parallel computing systems

Beyond loop partitioning: data assignment and overlap to reduce communication overhead

ICS '91 Proceedings of the 5th international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study here the behavior of two numerical algorithms (matrix multiplications and finite difference methods) on a three-level memory hierarchy multiprocessor RP3. Using different versions of these algorithms which differ on data placement (global, local, global and cacheable, local and cacheable) and on data access (blocked on non-blocked), we study the impact of these parameters on the performance of the program. This performance analysis is done using a very accurate monitoring system (VPMC) which records instructions, memory requests, cache requests and misses. We perform also a theoretical performance analysis of these programs using a model of computation and communication. Good agreements are found between theoretical and experimental results. As a conclusion we discuss the use of local memory on such a machine and show it is not worth with the RP3 ratio of communication between local and global memories. We also discuss optimal use of cache, show the optima can only be met under some cache properties (private store-in cache with user control of write-back) and show blocked optimal algorithms are to be used to meet it.