The importance of non-data touching processing overheads in TCP/IP
SIGCOMM '93 Conference proceedings on Communications architectures, protocols and applications
Functional divisions in the Piglet multiprocessor operating system
Proceedings of the 8th ACM SIGOPS European workshop on Support for composing distributed applications
An implementation and analysis of the virtual interface architecture
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Queue pair IP: a hybrid architecture for system area networks
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
On the elusive benefits of protocol offload
NICELI '03 Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence: experience, lessons, implications
A case for virtual channel processors
NICELI '03 Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence: experience, lessons, implications
Efficient Direct User Level Sockets for an Intel® Xeon" Processor Based TCP On-Load Engine
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Storage Over IP: When Does Hardware Support Help?
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Loosely Coupled TCP Acceleration Architecture
HOTI '06 Proceedings of the 14th IEEE Symposium on High-Performance Interconnects
Server network scalability and TCP offload
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
TCP offload is a dumb idea whose time has come
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Speeding up TCP/IP: faster processors are not enough
PCC '02 Proceedings of the Performance, Computing, and Communications Conference, 2002. on 21st IEEE International
An evaluation of network stack parallelization strategies in modern operating systems
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Optimizing TCP receive performance
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization
Proceedings of the 23rd international conference on Supercomputing
The multikernel: a new OS architecture for scalable multicore systems
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Corey: an operating system for many cores
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Scalable I/O - a well-architected way to do scalable, secure and virtualized I/O
WIOV'08 Proceedings of the First conference on I/O virtualization
End system optimizations for high-speed TCP
IEEE Communications Magazine
On the DMA mapping problem in direct device assignment
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
The turtles project: design and implementation of nested virtualization
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sharing the data center network
Proceedings of the 8th USENIX conference on Networked systems design and implementation
SplitX: split guest/hypervisor execution on multi-core
WIOV'11 Proceedings of the 3rd conference on I/O virtualization
vIOMMU: efficient IOMMU emulation
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Opportunistic flooding to improve TCP transmit performance in virtualized clouds
Proceedings of the 2nd ACM Symposium on Cloud Computing
Adding advanced storage controller functionality via low-overhead virtualization
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
A smart HPC interconnect for clusters of virtual machines
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Cache-aware affinitization on commodity multicores for high-speed network flows
Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
The impact of hybrid-core processors on MPI message rate
Proceedings of the 20th European MPI Users' Group Meeting
Rethinking network stack design with memory snapshots
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows
NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Efficient and scalable paravirtual I/O system
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
vTurbo: accelerating virtual machine I/O processing using designated turbo-sliced core
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Shrinking the hypervisor one subsystem at a time: a userspace packet switch for virtual machines
Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
SENIC: scalable NIC for end-host rate limiting
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Sharing data between the processors becomes increasingly expensive as the number of cores in a system grows. In particular, the network processing overhead on larger systems can reach tens of thousands of CPU cycles per TCP packet, for just hundreds of "useful" instructions. Most of these cycles are spent waiting - when the CPU is stalled while accessing "bouncing" cache lines of network control data shared by all processors in the system - and synchronizing access to this shared state. In many cases, the resulting excessive CPU utilization limits the overall system performance. We describe an IsoStack architecture which eliminates the unnecessary sharing of network control state at all stack layers, from the low-level device access, through the transport protocol, to the socket interface layer. The IsoStack "offloads" network stack processing to a dedicated processor core; multiple applications running on the rest of the cores invoke the IsoStack services in parallel, using a thin access layer that emulates the standard sockets API, without introducing new dependencies between the processors. We present a prototype implementation of this architecture, and provide detailed performance analysis. We demonstrate the ability to scale up the number of application threads and scale down the size of messages. In particular, we show an order of magnitude performance improvement for short messages, reaching the 10Gb/s line speed at 40% CPU utilization even for 64 byte messages, while the unmodified system is choked when driving 11 times less throughput.