An Application-Driven Study of Multicast Communication for Write Invalidation

Authors:
Hung-Chang Hsiao;Chung-Ta King
Affiliations:
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300, R.O.C.;Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300, R.O.C. king@cs.nthu.edu.tw
Venue:
The Journal of Supercomputing
Year:
2001

Citing 33
Cited 0

An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Analysis of cache invalidation patterns in multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Experimental comparison of memory management policies for NUMA multiprocessors

ACM Transactions on Computer Systems (TOCS)
The Stanford Dash Multiprocessor

Computer
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fast, contention-free combining tree barriers for shared-memory multiprocessors

International Journal of Parallel Programming
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
S-connect: from networks of workstations to supercomputer performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Zero-cycle loads: microarchitecture support for reducing load latency

Proceedings of the 28th annual international symposium on Microarchitecture
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Performance benefits of virtual channels and adaptive routing: an application-driven study

ICS '97 Proceedings of the 11th international conference on Supercomputing
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Designing Tree-Based Barrier Synchronization on 2D Mesh Networks

IEEE Transactions on Parallel and Distributed Systems
Using CSIM to model complex systems

WSC '88 Proceedings of the 20th conference on Winter simulation
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Earthquake ground motion modeling on parallel computers

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach
Scalable Shared-Memory Multiprocessing

Scalable Shared-Memory Multiprocessing
Multiprocessors Should Support Simple Memory-Consistency Models

Computer
Multicast Communication in Multicomputer Networks

IEEE Transactions on Parallel and Distributed Systems
How Much Does Network Contention Affect Distributed Shared Memory Performance?

ICPP '97 Proceedings of the international Conference on Parallel Processing
Impact of Adaptivity on the Behaviour of Networks of Workstations under Bursty Traffic

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme

PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
Turn grouping for efficient multicast in wormhole mesh networks

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Origin 2000 Design Enhancements for Communication Intensive Applications

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
An Efficient Implementation of Tree-Based Multicast Routing for Distributed Shared-Memory Multiprocessors

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

Quantified Score

Hi-index	0.00

Visualization

Abstract

In distributed shared-memory (DSM) multiprocessors, a write operation requires multiple messages to invalidate the nodes which share and cache the memory block to being written. The consequent write stall time impedes the performance of such systems. An effective means of achieving efficient invalidation is to employ multicast messages to reach the sharing nodes. This study evaluates two multicast-based invalidation schemes, dual-path and pruning, by performing application-driven simulation. The experimental settings used herein find that multicasts improve invalidation traffic for four of the six evaluated real applications. The remaining two applications are computationally intensive, and multicast-based invalidation is less effective. However, since multicasts encourage bursty communication, our results indicate that they help relieve network congestion during these periods. Dual-path performs slightly better than pruning, because it is less sensitive to routing delay in the routers. Our results further demonstrate that cache size is an important design parameter for multicast-based invalidation, and is highly effective for DSM multiprocessors with larger caches.