Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
A bridging model for parallel computation
Communications of the ACM
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Coherent network interfaces for fine-grain communication
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Reactive NUMA: a design for unifying S-COMA and CC-NUMA
Proceedings of the 24th annual international symposium on Computer architecture
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Proceedings of the sixteenth ACM symposium on Operating systems principles
Performance experiences on Sun's Wildfire prototype
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Application-specific protocols for user-level shared memory
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
The Alpha 21364 Network Architecture
HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
An evaluation of global address space languages: co-array fortran and unified parallel C
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
High performance RDMA-based MPI implementation over infiniBand
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Integrated network interfaces for high-bandwidth TCP/IP
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
The multikernel: a new OS architecture for scalable multicore systems
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
What is Twitter, a social network or a news media?
Proceedings of the 19th international conference on World wide web
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
vNUMA: a virtual shared-memory multiprocessor
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
DRAMSim2: A Cycle Accurate Memory System Simulator
IEEE Computer Architecture Letters
Crunching large graphs with commodity processors
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Fast crash recovery in RAMCloud
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Proceedings of the 39th Annual International Symposium on Computer Architecture
Using one-sided RDMA reads to build a fast, CPU-efficient key-value store
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Network interface design for low latency request-response protocols
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) -- an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -- a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.