Scale-out NUMA

Authors:
Stanko Novakovic;Alexandros Daglis;Edouard Bugnion;Babak Falsafi;Boris Grot
Affiliations:
EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 40
Cited 0

Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
A bridging model for parallel computation

Communications of the ACM
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor

Computer
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Virtual memory mapped network interface for the SHRIMP multicomputer

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hive: fault containment for shared-memory multiprocessors

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Application-specific protocols for user-level shared memory

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
The Alpha 21364 Network Architecture

HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
An evaluation of global address space languages: co-array fortran and unified parallel C

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
High performance RDMA-based MPI implementation over infiniBand

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Integrated network interfaces for high-bandwidth TCP/IP

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Productivity and performance using partitioned global address space languages

Proceedings of the 2007 international workshop on Parallel symbolic computation
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
vNUMA: a virtual shared-memory multiprocessor

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
It's time for low latency

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
DRAMSim2: A Cycle Accurate Memory System Simulator

IEEE Computer Architecture Letters
Crunching large graphs with commodity processors

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Scale-out processors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Using one-sided RDMA reads to build a fast, CPU-efficient key-value store

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Network interface design for low latency request-response protocols

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) -- an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -- a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.