Cache coherence protocols: evaluation using a multiprocessor simulation model
ACM Transactions on Computer Systems (TOCS)
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Volume rendering on scalable shared-memory MIMD architectures
VVS '92 Proceedings of the 1992 workshop on Volume visualization
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The performance of cache-coherent ring-based multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Exploring the design space for a shared-cache multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
ACM Computing Surveys (CSUR)
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors
MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
Rationale, Design and Performance of the Hydra Multiprocessor
Rationale, Design and Performance of the Hydra Multiprocessor
Evaluation of design alternatives for a multiprocessor microprocessor
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A performance evaluation of cluster architectures
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A performance comparison of contemporary DRAM architectures
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors
IEEE Transactions on Computers
High-Performance DRAMs in Workstation Environments
IEEE Transactions on Computers
A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
CQoS: a framework for enabling QoS in shared caches of CMP platforms
Proceedings of the 18th annual international conference on Supercomputing
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic cache clustering for chip multiprocessors
Proceedings of the 23rd international conference on Supercomputing
A case for globally shared-medium on-chip interconnect
Proceedings of the 38th annual international symposium on Computer architecture
DAPSCO: Distance-aware partially shared cache organization
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Cache management for discrete processor architectures
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Enhancing effective throughput for transmission line-based bus
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hi-index | 0.01 |
As processor performance continues to increase, greater demands are placed on the bus and memory systems of small-scale shared-memory multiprocessors. In this paper, we investigate how to reduce these demands by organizing groups of processors into clusters which are then connected together using a shared global bus. We take advantage of the high-bandwidth, low-latency interconnections available from multichip module (MCM) technology, to build clusters with multiple high-performance processors sharing an L2 cache. The use of MCM technology allows for significantly lower shared-cache access times, and higher shared cache to processor bandwidth, than is possible using printed circuit board (PCB) designs. Our results show that for an eight processor bus-based system, bus contention can be a large portion of the overall execution time, and that clustering can eliminate much or all of it. Clustering also tends to reduce read stall times due to shared working set effects and a reduction in the effect of communication misses. The same is true for two and four processor systems, although to a lesser extent. Overall, we find that clustering can result in significant performance gains for applications which heavily utilize the memory system.