Memory access buffering in multiprocessors
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The cedar system and an initial performance study
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
METRO: a router architecture for high-performance, short-haul routing networks
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data prefetching for high-performance processors
Data prefetching for high-performance processors
On shortest path routing in single stage shuffle-exchange networks
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Supercomputer performance evaluation and the Perfect Benchmarks
ICS '90 Proceedings of the 4th international conference on Supercomputing
Compiler-directed data prefetching in multiprocessors with memory hierarchies
ICS '90 Proceedings of the 4th international conference on Supercomputing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
Design and Analysis of Even-Sized Binary Shuffle-Exchange Networks for Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Architecture and Implementation of Vulcan
Proceedings of the 8th International Symposium on Parallel Processing
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Hi-index | 0.00 |
Latency hiding techniques are increasingly used to minimize the effect of a long memory latency in multiprocessors. Their use requires additional network bandwidth. The network organization and its design parameters alone can significantly affect performance. With latency hiding, system performance depends on how well the interconnection network can support the use of such techniques and their interaction with network organization. This paper investigates these issues for prefetching and weak consistency in a 128-processor shared-memory system with either a 2-D torus, a multistage, or a single-stage network. The performance impact of network organization and the link bandwidth, with and without the use of latency hiding techniques is shown. The effect of caching and of limiting the number of outstanding memory requests is shown. Multistage is the most robust network and has the best performance under all conditions. Single-stage network is very close in performance when sufficient channel bandwidth is available. Torus network comes in last when channel bandwidth is high, but can exceed single stage performance when it is low. The relative performance of the three networks with prefetching remains similar, with torus gaining the most. Benchmark execution time can decrease by as much as 25% with prefetching. Further gains depend on reducing the effect of write traffic. Finally, the existence of an optimal number of outstanding requests is shown but the value is program-dependent.