The impact of shared-cache clustering in small-scale shared-memory multiprocessors

Authors:
B. A. Nayfeh;K. Olukotun;J. P. Singh
Affiliations:
-;-;-
Venue:
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Year:
1996

Citing 12
Cited 17

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The performance of cache-coherent ring-based multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Exploring the design space for a shared-cache multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The benefits of clustering in shared address space multiprocessors: an applications-driven investigation

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Cache Memories

ACM Computing Surveys (CSUR)
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors

MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
Rationale, Design and Performance of the Hydra Multiprocessor

Rationale, Design and Performance of the Hydra Multiprocessor

Evaluation of design alternatives for a multiprocessor microprocessor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A performance evaluation of cluster architectures

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors

IEEE Transactions on Computers
High-Performance DRAMs in Workstation Environments

IEEE Transactions on Computers
A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Characterization and Evaluation of Cache Hierarchies for Web Servers

World Wide Web
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures

Microprocessors & Microsystems
Dynamic cache clustering for chip multiprocessors

Proceedings of the 23rd international conference on Supercomputing
A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
DAPSCO: Distance-aware partially shared cache organization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Cache management for discrete processor architectures

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Enhancing effective throughput for transmission line-based bus

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

As processor performance continues to increase, greater demands are placed on the bus and memory systems of small-scale shared-memory multiprocessors. In this paper, we investigate how to reduce these demands by organizing groups of processors into clusters which are then connected together using a shared global bus. We take advantage of the high-bandwidth, low-latency interconnections available from multichip module (MCM) technology, to build clusters with multiple high-performance processors sharing an L2 cache. The use of MCM technology allows for significantly lower shared-cache access times, and higher shared cache to processor bandwidth, than is possible using printed circuit board (PCB) designs. Our results show that for an eight processor bus-based system, bus contention can be a large portion of the overall execution time, and that clustering can eliminate much or all of it. Clustering also tends to reduce read stall times due to shared working set effects and a reduction in the effect of communication misses. The same is true for two and four processor systems, although to a lesser extent. Overall, we find that clustering can result in significant performance gains for applications which heavily utilize the memory system.