Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Authors:
Per Stenström;Truman Joe;Anoop Gupta
Affiliations:
-;-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 6
Cited 48

Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Communication in the KSR1 MPP: performance evaluation using synthetic workload experiments

ICS '94 Proceedings of the 8th international conference on Supercomputing
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
Request Combining in Multiprocessors with Arbitrary Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
Evaluating the memory overhead required for COMA architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Scheduling and page migration for multiprocessor compute servers

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Architectures

IEEE Transactions on Parallel and Distributed Systems
Implementing global memory management in a workstation cluster

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Memory organization in multi-channel optical networks: NUMA and COMA revisited

ICS '96 Proceedings of the 10th international conference on Supercomputing
Evaluating the limits of message passing via the shared attraction memory on CC-COMA machines: experiences with TCGMSG and PVM

ICS '96 Proceedings of the 10th international conference on Supercomputing
Performance analysis on a CC-NUMA prototype

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
In-memory directories: eliminating the cost of directories in CC-NUMAs

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors

Proceedings of the 25th annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

25 years of the international symposia on Computer architecture (selected papers)
Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

IEEE Transactions on Computers
A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

IEEE Transactions on Computers - Special issue on cache memory and related problems
Excel-NUMA: Toward Programmability, Simplicity, and High Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

IEEE Transactions on Computers
Exploiting Network Locality for CC-NUMA Multiprocessors

The Journal of Supercomputing
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols

IEEE Transactions on Computers
Load Balancing for Parallel Query Execution on NUMA Multiprocessors

Distributed and Parallel Databases
A Simulation Study of Hardware-Oriented DSM Approaches

IEEE Parallel & Distributed Technology: Systems & Technology
Cache-Only Memory Architectures

Computer
SDAARC: An Extended Cache-Only Memory Architecture

IEEE Micro
Comparative Performance Evaluation of Hot Spot Contention Between MIN-Based and Ring-Based Shared-Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Hardware Versus Software Implementation of COMA

ICPP '97 Proceedings of the international Conference on Parallel Processing
Implementing the Data Diffusion Machine Using Crossbar Routers

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Bus-based COMA-reducing traffic in shared-bus multiprocessors

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
HP scalable computing architecture

WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
Latency Analyses of CC-NUMA and CC-COMA Rings

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
YAARC: yet another approach to further reducing the rate of conflict misses

The Journal of Supercomputing
A comparative evaluation of hybrid distributed shared-memory systems

Journal of Systems Architecture: the EUROMICRO Journal
The CDAG: a data structure for automatic parallelization for a multithreaded architecture

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Tailoring a self-distributing architecture to a cluster computer environment

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing

Quantified Score

Hi-index	0.02

Visualization

Abstract

Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.