Memory access buffering in multiprocessors
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
NUMA policies and their relation to memory architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
SPLASH: Stanford parallel applications for shared-memory
SPLASH: Stanford parallel applications for shared-memory
Volume rendering on scalable shared-memory MIMD architectures
VVS '92 Proceedings of the 1992 workshop on Volume visualization
Adaptive cache coherency for detecting migratory shared data
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Communication in the KSR1 MPP: performance evaluation using synthetic workload experiments
ICS '94 Proceedings of the 8th international conference on Supercomputing
Performance evaluation of hybrid hardware and software distributed shared memory protocols
ICS '94 Proceedings of the 8th international conference on Supercomputing
Request Combining in Multiprocessors with Arbitrary Interconnection Networks
IEEE Transactions on Parallel and Distributed Systems
Evaluating the memory overhead required for COMA architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Scheduling and page migration for multiprocessor compute servers
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A comprehensive bibliography of distributed shared memory
ACM SIGOPS Operating Systems Review
Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Architectures
IEEE Transactions on Parallel and Distributed Systems
Implementing global memory management in a workstation cluster
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Memory organization in multi-channel optical networks: NUMA and COMA revisited
ICS '96 Proceedings of the 10th international conference on Supercomputing
ICS '96 Proceedings of the 10th international conference on Supercomputing
Performance analysis on a CC-NUMA prototype
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
In-memory directories: eliminating the cost of directories in CC-NUMAs
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs
ICS '98 Proceedings of the 12th international conference on Supercomputing
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors
Proceedings of the 25th annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
25 years of the international symposia on Computer architecture (selected papers)
IEEE Transactions on Computers
IEEE Transactions on Computers - Special issue on cache memory and related problems
Excel-NUMA: Toward Programmability, Simplicity, and High Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
IEEE Transactions on Computers
Exploiting Network Locality for CC-NUMA Multiprocessors
The Journal of Supercomputing
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols
IEEE Transactions on Computers
Load Balancing for Parallel Query Execution on NUMA Multiprocessors
Distributed and Parallel Databases
A Simulation Study of Hardware-Oriented DSM Approaches
IEEE Parallel & Distributed Technology: Systems & Technology
Cache-Only Memory Architectures
Computer
IEEE Transactions on Parallel and Distributed Systems
Hardware Versus Software Implementation of COMA
ICPP '97 Proceedings of the international Conference on Parallel Processing
Implementing the Data Diffusion Machine Using Crossbar Routers
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Bus-based COMA-reducing traffic in shared-bus multiprocessors
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
HP scalable computing architecture
WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
Latency Analyses of CC-NUMA and CC-COMA Rings
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
YAARC: yet another approach to further reducing the rate of conflict misses
The Journal of Supercomputing
A comparative evaluation of hybrid distributed shared-memory systems
Journal of Systems Architecture: the EUROMICRO Journal
The CDAG: a data structure for automatic parallelization for a multithreaded architecture
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Tailoring a self-distributing architecture to a cluster computer environment
EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Hi-index | 0.02 |
Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.