Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Architectures

Authors:
Xiaodong Zhang;Yong Yan
Affiliations:
-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1995

Citing 13
Cited 5

Short-Packet Transfer Performance in Local Area Ring Networks

IEEE Transactions on Computers
Approximate Analysis of Single and Multiple Ring Networks

IEEE Transactions on Computers
Hector: A Hierarchically Structured Shared-Memory Multiprocessor

Computer - Special issue on experimental research in computer architecture
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
DDM: A Cache-Only Memory Architecture

Computer
Cache consistency in hierarchical-ring-based multiprocessors

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
The performance of cache-coherent ring-based multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Wisconsin Wind Tunnel: virtual prototyping of parallel computers

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Spin-Lock Synchronization on the Butterfly and KSR1

IEEE Parallel & Distributed Technology: Systems & Technology
Latency metric: an experimental method for measuring and evaluating parallel program and architecture scalability

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems
Comparative Performance Evaluation of Hot Spot Contention Between MIN-Based and Ring-Based Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems

Hierarchical Ring Network Configuration and Performance Modeling

IEEE Transactions on Computers
Comparative Performance Evaluation of Hot Spot Contention Between MIN-Based and Ring-Based Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Performance and Configuration of Hierarchical Ring Networks for Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Torus Ring: improving performance of interconnection network by modifying hierarchical ring

Parallel Computing
Comparative evaluation and case studies of shared-memory and data-parallel execution patterns[1]

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel computing performance on scalable shared-memory architectures is affected by the structure of the interconnection networks linking processors to memory modules and on the efficiency of the memory/cache management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache Only Memory Access (COMA) are two effective memory systems, and the hierarchical ring structure is an efficient interconnection network in hardware. This paper focuses on comparative performance modeling and evaluation of CC-NUMA and COMA on a hierarchical ring shared-memory architecture. Analytical models for the two memory systems for comparative evaluation are presented. Intensive performance measurements on data migrations have been conducted on the KSR-1, a COMA hierarchical ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and comparisons of the two cache coherence memory systems. Our analytical and experimental results show that a COMA system balances the work load well. However the overhead of frequent data movement may match the gains obtained from improving load balance. We believe our performance results could be further generalized to the two memory systems on a hierarchical network architecture. Although a CC-NUMA system may not automatically balance the load at the system level, it provides an option for a user to explicitly handle data locality for a possible performance improvement.