Experimental comparison of memory management policies for NUMA multiprocessors
ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The KSR1: experimentation and modeling of poststore
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
IBM Power and PowerPC
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Hardware and software support for efficient exception handling
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A Case for NOW (Networks of Workstations)
IEEE Micro
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors
Proceedings of the 25th annual international symposium on Computer architecture
Excel-NUMA: Toward Programmability, Simplicity, and High Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback
ACM Transactions on Computer Systems (TOCS)
Performance experiences on Sun's Wildfire prototype
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Selective, accurate, and timely self-invalidation using last-touch prediction
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Exploiting Network Locality for CC-NUMA Multiprocessors
The Journal of Supercomputing
MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Cache-Only Memory Architectures
Computer
Performance of High-Accuracy PDE Solvers on a Self-Optimizing NUMA Architecture
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Proceedings of the 30th annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory coherence activity prediction in commercial workloads
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Store-Ordered Streaming of Shared Memory
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A transparent runtime data distribution engine for OpenMP
Scientific Programming
HP scalable computing architecture
WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
The case for simple, visible cache coherency
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
A comparative evaluation of hybrid distributed shared-memory systems
Journal of Systems Architecture: the EUROMICRO Journal
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
This paper proposes and evaluates a new approach to directory-based cache coherence protocols called Reactive NUMA (R-NUMA). An R-NUMA system combines a conventional CC-NUMA coherence protocol with a more-recent Simple-COMA (S-COMA) protocol. What makes R-NUMA novel is the way it dynamically reacts to program and system behavior to switch between CC-NUMA and S-COMA and exploit the best aspects of both protocols. This reactive behavior allows each node in an R-NUMA system to independently choose the best protocol for a particular page, thus providing much greater performance stability than either CC-NUMA or S-COMA alone. Our evaluation is both qualitative and quantitative. We first show the theoretical result that R-NUMA's worst-case performance is bounded within a small constant factor (i.e., two to three times) of the best of CC-NUMA and S-COMA. We then use detailed execution-driven simulation to show that, in practice, R-NUMA usually performs better than either a pure CC-NUMA or pure S-COMA protocol, and no more than 57% worse than the best of CC-NUMA and S-COMA, for our benchmarks and base system assumptions.