Polycyclic Vector scheduling vs. Chaining on 1-Port Vector supercomputers
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
A Performance Comparison of the IBM RS/6000 and the Astronautics ZS-1
Computer - Special issue on experimental research in computer architecture
Comparative performance evaluation of cache-coherent NUMA and COMA architectures
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Hierarchical performance modeling with MACS: a case study of the convex C-240
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
ICS '93 Proceedings of the 7th international conference on Supercomputing
The KSR1: experimentation and modeling of poststore
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Micro benchmark analysis of the KSR1
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Analysis of Memory Latency Factors and Their Impact on KSR1 Performance
Proceedings of the 8th International Symposium on Parallel Processing
Hi-index | 0.00 |
We have developed an automatic technique for evaluating the communication performance of massively parallel processors (MPPs). Both communication latency and the amount of communication are investigated as a function of a few basic parameters that characterize an application workload. Parameter values are captured in an automatically generated sparse matrix that multiplies a dense vector in the synthetic workload. Our approach is capable of explaining the degradation of processor performance caused by communication.Using the Kendall Square Research KSR1 MPP as a case study, we demonstrate the effectiveness of the technique through a series of experiments used to characterize the communication performance. We show that read and write communciation latencies vary from 150 to 180 and from 80 to 100 processor cycles, respectively. We show that the read communication latency approximates a linear function of the total system communciation (in subpages), write communication approximates a linear function of the number of distinct shared subpages, and that KSR's automatic update feature is effective in reducing the number of read communications given careful binding of threads to processors.