Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Proceedings of the 1st International Conference on Supercomputing
RP3 performance monitoring hardware
Instrumentation for future parallel computing systems
Beyond loop partitioning: data assignment and overlap to reduce communication overhead
ICS '91 Proceedings of the 5th international conference on Supercomputing
Hi-index | 0.00 |
We study here the behavior of two numerical algorithms (matrix multiplications and finite difference methods) on a three-level memory hierarchy multiprocessor RP3. Using different versions of these algorithms which differ on data placement (global, local, global and cacheable, local and cacheable) and on data access (blocked on non-blocked), we study the impact of these parameters on the performance of the program. This performance analysis is done using a very accurate monitoring system (VPMC) which records instructions, memory requests, cache requests and misses. We perform also a theoretical performance analysis of these programs using a model of computation and communication. Good agreements are found between theoretical and experimental results. As a conclusion we discuss the use of local memory on such a machine and show it is not worth with the RP3 ratio of communication between local and global memories. We also discuss optimal use of cache, show the optima can only be met under some cache properties (private store-in cache with user control of write-back) and show blocked optimal algorithms are to be used to meet it.