Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors

  • Authors:
  • Mark Holliday;Michael Stumm

  • Affiliations:
  • -;-

  • Venue:
  • Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors
  • Year:
  • 1992

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates the performance of word-packet, slotted unidirectional ring-based hierarchical direct networks in the context of large-scale shared memory multiprocessors. Slotted unidirectional rings are attractive because their electrical characteristics and simple interfaces allow for fast cycle times and large bandwidths. For large-scale systems, it is necessary to use multiple rings for increased aggregate bandwidth. Hierarchies are attractive because the topology ensures unique paths between nodes, simple node interfaces and simple inter-ring connections. To ensure that a realistic region of the design space is examined, the architecture of the network used in the Hector prototype is adopted as the initial design point. A simulator of that architecture has been developed and validated with measurements from the prototype. The system and workload parameterization reflects conditions expected in the near future. The results of our study show the importance of system balance on performance. Large-scale systems inherently have large communication delays for distant accesses, so processor efficiency will be low, unless the processors can operate with multiple outstanding transactions using techniques such as prefetching, asynchronous writes and multiple hardware contexts. However with multiple outstanding transactions and only one memory bank per processing module, memory quickly saturates. Memory saturation can be alleviated by having multiple memory banks per processing module, but this shifts the bottleneck to the ring subsystem. While the topology of the ring hierarchy affects performance --- we show that topologies with a similar branching factor at all levels, except, possibly, for slightly smaller rings at the root of the hierarchy tend to perform best --- the ring subsystem will inherently limit the throughput of the system. Hence increasing the number of outstanding transactions per processor beyond a certain point only has a limiting effect on performance, since it causes some of the rings to become congested. An adaptive maximum number of outstanding transactions appears necessary to adjust for the appropriate tradeoff between concurrency and contention as the communication locality changes. We show the relationships between processor, ring and memory speeds, and their effects on performance. Using block transfers for prefetching seems unlikely to be advantageous in that the improvement in the cache hit ratio needed to compensate for the increased network utilization is substantial.