IsoStack: highly efficient network processing on dedicated cores

Authors:
Leah Shalev;Julian Satran;Eran Borovik;Muli Ben-Yehuda
Affiliations:
IBM Research, Haifa;IBM Research, Haifa;IBM Research, Haifa;IBM Research, Haifa
Venue:
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Year:
2010

Citing 21
Cited 16

The importance of non-data touching processing overheads in TCP/IP

SIGCOMM '93 Conference proceedings on Communications architectures, protocols and applications
Functional divisions in the Piglet multiprocessor operating system

Proceedings of the 8th ACM SIGOPS European workshop on Support for composing distributed applications
An implementation and analysis of the virtual interface architecture

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Queue pair IP: a hybrid architecture for system area networks

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
On the elusive benefits of protocol offload

NICELI '03 Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence: experience, lessons, implications
A case for virtual channel processors

NICELI '03 Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence: experience, lessons, implications
TCP Onloading for Data Center Servers

Computer
Efficient Direct User Level Sockets for an Intel® Xeon" Processor Based TCP On-Load Engine

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Storage Over IP: When Does Hardware Support Help?

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
TCP performance re-visited

ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Loosely Coupled TCP Acceleration Architecture

HOTI '06 Proceedings of the 14th IEEE Symposium on High-Performance Interconnects
Server network scalability and TCP offload

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
TCP offload is a dumb idea whose time has come

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Speeding up TCP/IP: faster processors are not enough

PCC '02 Proceedings of the Performance, Computing, and Communications Conference, 2002. on 21st IEEE International
An evaluation of network stack parallelization strategies in modern operating systems

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Optimizing TCP receive performance

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization

Proceedings of the 23rd international conference on Supercomputing
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Scalable I/O - a well-architected way to do scalable, secure and virtualized I/O

WIOV'08 Proceedings of the First conference on I/O virtualization
End system optimizations for high-speed TCP

IEEE Communications Magazine

On the DMA mapping problem in direct device assignment

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
The turtles project: design and implementation of nested virtualization

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sharing the data center network

Proceedings of the 8th USENIX conference on Networked systems design and implementation
SplitX: split guest/hypervisor execution on multi-core

WIOV'11 Proceedings of the 3rd conference on I/O virtualization
vIOMMU: efficient IOMMU emulation

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Opportunistic flooding to improve TCP transmit performance in virtualized clouds

Proceedings of the 2nd ACM Symposium on Cloud Computing
Adding advanced storage controller functionality via low-overhead virtualization

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
A smart HPC interconnect for clusters of virtual machines

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Cache-aware affinitization on commodity multicores for high-speed network flows

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
The impact of hybrid-core processors on MPI message rate

Proceedings of the 20th European MPI Users' Group Meeting
Rethinking network stack design with memory snapshots

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Efficient and scalable paravirtual I/O system

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
vTurbo: accelerating virtual machine I/O processing using designated turbo-sliced core

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Shrinking the hypervisor one subsystem at a time: a userspace packet switch for virtual machines

Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
SENIC: scalable NIC for end-host rate limiting

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sharing data between the processors becomes increasingly expensive as the number of cores in a system grows. In particular, the network processing overhead on larger systems can reach tens of thousands of CPU cycles per TCP packet, for just hundreds of "useful" instructions. Most of these cycles are spent waiting - when the CPU is stalled while accessing "bouncing" cache lines of network control data shared by all processors in the system - and synchronizing access to this shared state. In many cases, the resulting excessive CPU utilization limits the overall system performance. We describe an IsoStack architecture which eliminates the unnecessary sharing of network control state at all stack layers, from the low-level device access, through the transport protocol, to the socket interface layer. The IsoStack "offloads" network stack processing to a dedicated processor core; multiple applications running on the rest of the cores invoke the IsoStack services in parallel, using a thin access layer that emulates the standard sockets API, without introducing new dependencies between the processors. We present a prototype implementation of this architecture, and provide detailed performance analysis. We demonstrate the ability to scale up the number of application threads and scale down the size of messages. In particular, we show an order of magnitude performance improvement for short messages, reaching the 10Gb/s line speed at 40% CPU utilization even for 64 byte messages, while the unmodified system is choked when driving 11 times less throughput.