On Interaction between Interconnection Network Design and Latency Hiding Techniques in Multiprocessors

Authors:
Sunil Kim;Alexander V. Veidenbaum
Affiliations:
Department of Computer Engineering, Hongik University, Mapo-gu Sangsoo-dong 72-1, Seoul 121-791, Koreasikim@cs.hongik.ac.kr;Information and Computer Science, University of California, Irvine, CA 92697-3425alexv@cs.uci.edu
Venue:
The Journal of Supercomputing
Year:
2000

Citing 17
Cited 0

Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The cedar system and an initial performance study

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
METRO: a router architecture for high-performance, short-haul routing networks

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data prefetching for high-performance processors

Data prefetching for high-performance processors
On shortest path routing in single stage shuffle-exchange networks

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Supercomputer performance evaluation and the Perfect Benchmarks

ICS '90 Proceedings of the 4th international conference on Supercomputing
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Design and Analysis of Even-Sized Binary Shuffle-Exchange Networks for Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Architecture and Implementation of Vulcan

Proceedings of the 8th International Symposium on Parallel Processing
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latency hiding techniques are increasingly used to minimize the effect of a long memory latency in multiprocessors. Their use requires additional network bandwidth. The network organization and its design parameters alone can significantly affect performance. With latency hiding, system performance depends on how well the interconnection network can support the use of such techniques and their interaction with network organization. This paper investigates these issues for prefetching and weak consistency in a 128-processor shared-memory system with either a 2-D torus, a multistage, or a single-stage network. The performance impact of network organization and the link bandwidth, with and without the use of latency hiding techniques is shown. The effect of caching and of limiting the number of outstanding memory requests is shown. Multistage is the most robust network and has the best performance under all conditions. Single-stage network is very close in performance when sufficient channel bandwidth is available. Torus network comes in last when channel bandwidth is high, but can exceed single stage performance when it is low. The relative performance of the three networks with prefetching remains similar, with torus gaining the most. Benchmark execution time can decrease by as much as 25% with prefetching. Further gains depend on reducing the effect of write traffic. Finally, the existence of an optimal number of outstanding requests is shown but the value is program-dependent.