Evaluating network processing efficiency with processor partitioning and asynchronous I/O

Authors:
Tim Brecht;G. (John) Janakiraman;Brian Lynn;Vikram Saletore;Yoshio Turner
Affiliations:
University of Waterloo;Hewlett Packard Laboratories;Hewlett Packard Laboratories;Intel® Corporation;Hewlett Packard Laboratories
Venue:
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Year:
2006

Citing 25
Cited 6

The importance of non-data touching processing overheads in TCP/IP

SIGCOMM '93 Conference proceedings on Communications architectures, protocols and applications
Profiling and reducing processing overheads in TCP/IP

IEEE/ACM Transactions on Networking (TON)
Eliminating receive livelock in an interrupt-driven kernel

ACM Transactions on Computer Systems (TOCS)
Functional divisions in the Piglet multiprocessor operating system

Proceedings of the 8th ACM SIGOPS European workshop on Support for composing distributed applications
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Win32 Systems Programming

Win32 Systems Programming
The Virtual Interface Architecture

IEEE Micro
The APIC Approach to High Performance Network Interface Design: Protected DMA and Other Techniques

INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
UNIX Network Programming, Vol. 1

UNIX Network Programming, Vol. 1
On the elusive benefits of protocol offload

NICELI '03 Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence: experience, lessons, implications
An Efficient Zero-Copy I/O Framework for UNIX

An Efficient Zero-Copy I/O Framework for UNIX
TCP Onloading for Data Center Servers

Computer
Efficient Direct User Level Sockets for an Intel® Xeon" Processor Based TCP On-Load Engine

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Direct Cache Access for High Bandwidth Network I/O

Proceedings of the 32nd annual international symposium on Computer Architecture
Storage Over IP: When Does Hardware Support Help?

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Performance Analysis of System Overheads in TCP/IP Workloads

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
TCP performance re-visited

ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
Server network scalability and TCP offload

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Acceptable strategies for improving web server performance

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Lazy asynchronous I/O for event-driven servers

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TCP offload is a dumb idea whose time has come

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Beyond softnet

ALS '01 Proceedings of the 5th annual Linux Showcase & Conference - Volume 5
Flash: an efficient and portable web server

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
A scalable and explicit event delivery mechanism for UNIX

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Trapeze/IP: TCP/IP at near-gigabit speeds

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference

High performance and scalable I/O virtualization via self-virtualized devices

Proceedings of the 16th international symposium on High performance distributed computing
Connection handoff policies for TCP offload network interfaces

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization

Proceedings of the 23rd international conference on Supercomputing
A TCP offload engine emulator for estimating the impact of removing protocol processing from a host running Apache HTTP server

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
Comparing high-performance multi-core web-server architectures

Proceedings of the 5th Annual International Systems and Storage Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications requiring high-speed TCP/IP processing can easily saturate a modern server. We and others have previously suggested alleviating this problem in multiprocessor environments by dedicating a subset of the processors to perform network packet processing. The remaining processors perform only application computation, thus eliminating contention between these functions for processor resources. Applications interact with packet processing engines (PPEs) using an asynchronous I/O (AIO) programming interface which bypasses the operating system. A key attraction of this overall approach is that it exploits the architectural trend toward greater thread-level parallelism in future systems based on multi-core processors. In this paper, we conduct a detailed experimental performance analysis comparing this approach to a best-practice configured Linux baseline system.We have built a prototype system implementing this architecture, ETA+AIO (Embedded Transport Acceleration with Asynchronous I/O), and ported a high-performance web-server to the AIO interface. Although the prototype uses modern single-core CPUs instead of future multi-core CPUs, an analysis of its performance can reveal important properties of this approach. Our experiments show that the ETA+AIO prototype has a modest advantage over the baseline Linux system in packet processing efficiency, consuming fewer CPU cycles to sustain the same throughput. This efficiency advantage enables the ETA+AIO prototype to achieve higher peak throughput than the baseline system, but only for workloads where the mix of packet processing and application processing approximately matches the allocation of CPUs in the ETA+AIO system thereby enabling high utilization of all the CPUs. Detailed analysis shows that the efficiency advantage of the ETA+AIO prototype, which uses one PPE CPU, comes from avoiding multiprocessing overheads in packet processing, lower overhead of our AIO interface compared to standard sockets, and reduced cache misses due to processor partitioning.