Performance of a shared memory system for vector multiprocessors
ICS '88 Proceedings of the 2nd international conference on Supercomputing
Accurate modelling of interconnection networks in vector supercomputers
ICS '91 Proceedings of the 5th international conference on Supercomputing
Characterizing memory performance in vector multiprocessors
ICS '92 Proceedings of the 6th international conference on Supercomputing
The cedar system and an initial performance study
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The J-machine multicomputer: an architectural evaluation
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Interconnection Networks for Parallel and Distributed Processing
Interconnection Networks for Parallel and Distributed Processing
The Vesta parallel file system
ACM Transactions on Computer Systems (TOCS)
Minimizing Conflicts Between Vector Streams in Interleaved Memory Systems
IEEE Transactions on Computers
Increasing the effective bandwidth of complex memory systems in multivector processors
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
While multistage switching networks for vector multiprocessors have been studied extensively, detailed evaluations of their performance are rare. Indeed, analytical models, simulations with pseudo-synthetic loads, studies focused on average-value parameters, and measurements of networks disconnected from the machine, all provide limited information. In this paper, instead, we present an in-depth empirical analysis of a multistage switching network in a realistic setting: we use hardware probes to examine the performance of the omega network of the Cedar shared-memory machine executing real applications. The machine is configured with 16 vector processors.The analysis suggests that the performance of multistage switching networks is limited by traffic non-uniformities. We identify two major non-uniformities that degrade Cedar's performance and are likely to slow down other networks too. The first one is the contention caused by the return messages in a vector access as they converge from the memories to one processor port. This traffic convergence penalizes vector reads and, more importantly, causes tree saturation. The second non-uniformity is the uneven contention delays induced by even a relatively fair scheme to resolve message collisions. Based on our observations, we argue that intuitive optimizations for multistage switching networks may not be cost-effective. Instead, we suggest changes to increase the network bandwidth at the root of the traffic convergence tree and to delay traffic convergence up until the final stages of the network.