Deadlock-Free Message Routing in Multiprocessor Interconnection Networks
IEEE Transactions on Computers
The Stanford Dash Multiprocessor
Computer
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Evaluating virtual channels for cache-coherent shared-memory multiprocessors
ICS '96 Proceedings of the 10th international conference on Supercomputing
Performance benefits of virtual channels and adaptive routing: an application-driven study
ICS '97 Proceedings of the 11th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving the performance of bristled CC-NUMA systems using virtual channels and adaptivity
ICS '99 Proceedings of the 13th international conference on Supercomputing
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
FLASH vs. (Simulated) FLASH: closing the simulation loop
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A comparative study of arbitration algorithms for the Alpha 21364 pipelined router
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Spider: A High-Speed Network Interconnect
IEEE Micro
The Alpha 21364 Network Architecture
IEEE Micro
IEEE Transactions on Parallel and Distributed Systems
ACM SIGARCH Computer Architecture News
The performance and scalability of distributed shared-memory cache coherence protocols
The performance and scalability of distributed shared-memory cache coherence protocols
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
IEEE Transactions on Computers
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications
IEEE Transactions on Parallel and Distributed Systems
Recursive partitioning multicast: A bandwidth-efficient routing for Networks-on-Chip
NOCS '09 Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip
The connection-then-credit flow control protocol for heterogeneous multicore systems-on-chip
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems - Special issue on the 2009 ACM/IEEE international symposium on networks-on-chip
Dual partitioning multicasting for high-performance on-chip networks
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Distributed shared memory (DSM) multiprocessors typically require disjoint networks for deadlock-free execution of cache coherence protocols. This is normally achieved by implementing virtual networks with the help of virtual channels or virtual lanes multiplexed on a single physical network. To keep the coherence protocol simple, messages are usually assigned to virtual lanes in a predefined static manner based on a cycle-free lane assignment dependence graph. However, this static split of virtual networks (such as request and reply networks) may lead to underutilization of certain virtual networks while saturating the other networks. In this paper, we explore different static and dynamic schemes to select the virtual lanes for outgoing messages and mix the load among them without restricting any particular type of message to be carried only by a particular virtual network. We achieve this by exposing the selection algorithms to the coherence protocol itself, so that it can inject messages into selected virtual lanes based on some local information, and still enjoy deadlock-freedom. Our execution-driven simulation on five applications from the SPLASH-2 suite shows that as the system scales, the virtual network selection algorithms play an important role. For 128-node systems, our dynamic selection algorithm speeds up parallel execution by as much as 22 percent over an optimized baseline system running a modified SGI Origin 2000 protocol. We also explore how network latency, the number of message buffers per virtual lane, and the depth of network interface output queues affect the relative performance of various virtual lane selection algorithms.