On the inclusion properties for multi-level cache hierarchies
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Architectural mechanisms for explicit communication in shared memory multiprocessors
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Decoupled hardware support for distributed shared memory
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Coherence controller architectures for SMP-based CC-NUMA multiprocessors
Proceedings of the 24th annual international symposium on Computer architecture
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors
MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
Exploiting Parallelism in Cache Coherency Protocol Engines
Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
On the use and performance of communication primitives in software controlled cache-coherent cluster architectures
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
Software cache-coherent systems using programmable protocol processors provide a flexible infrastructure to expand the systems in size and function. However this flexibility comes at a cost in performance. First, the software implementation of protocols is inherently slower than a hardware implementation. Second, when multiple processors share a protocol processor, contention may result in a substantial increase in memory latency.In this paper, we study how the overhead of a software scheme can be reduced in the context of a shared-memory system consisting of SMP clusters. We study various design choices including hardware assists such as forwarding logic in the protocol processor and software hints through explicit communication primitives. We conduct our experiments via trace-driven simulation and compare the execution of three programs from the SPLASH-2 suite.We found that small cluster sizes (up to 4 processors/node) work well for both hardware and software implementations. When the forwarding logic is incorporated with the software scheme, the performance is competitive to that of the hardware scheme. When enhanced further by explicit communication primitives, the software scheme can perform even better than a pure hardware implementation. This is particularly noticeable when the network latency is high.