Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows

Authors:
Nathan Hanford;Vishal Ahuja;Mehmet Balman;Matthew K. Farrens;Dipak Ghosal;Eric Pouyoul;Brian Tierney
Affiliations:
University of California, Davis, CA;University of California, Davis, CA;Lawrence Berkeley Laboratory, Berkeley, CA;University of California, Davis, CA;University of California, Davis, CA;Lawrence Berkeley Laboratory, Berkeley, CA;Lawrence Berkeley Laboratory, Berkeley, CA
Venue:
NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Year:
2013

Citing 15
Cited 0

Eliminating receive livelock in an interrupt-driven kernel

ACM Transactions on Computer Systems (TOCS)
Direct Cache Access for High Bandwidth Network I/O

Proceedings of the 32nd annual international symposium on Computer Architecture
Reducing the Impact of the MemoryWall for I/O Using Cache Injection

HOTI '07 Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects
Architectural Characterization of Processor Affinity in Network Processing

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Software techniques to improve virtualized I/O performance on multi-core systems

Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Efficient Translation of Algorithmic Kernels on Large-Scale Multi-cores

CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 02
MiAMI: Multi-core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Reliable communication for datacenters

Reliable communication for datacenters
IsoStack: highly efficient network processing on dedicated cores

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Cache injection for parallel applications

Proceedings of the 20th international symposium on High performance distributed computing
A new server I/O architecture for high speed networks

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
A Transport-Friendly NIC for Multicore/Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Operating systems abstractions for software packet processing in datacenters

Operating systems abstractions for software packet processing in datacenters
Cache-aware affinitization on commodity multicores for high-speed network flows

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Evaluating perceptual video quality for mobile clients in 802.11n WLAN

Proceedings of the 8th ACM international workshop on Wireless network testbeds, experimental evaluation & characterization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-core end-systems use Receive Side Scaling (RSS) to parallelize protocol processing. RSS uses a hash function on the standard flow descriptors and an indirection table to assign incoming packets to receive queues which are pinned to specific cores. This ensures flow affinity in that the interrupt processing of all packets belonging to a specific flow is processed by the same core. A key limitation of standard RSS is that it does not consider the application process that consumes the incoming data in determining the flow affinity. In this paper, we carry out a detailed experimental analysis of the performance impact of the application affinity in a 40 Gbps testbed network with a dual hexa-core end-system. We show, contrary to conventional wisdom, that when the application process and the flow are affinitized to the same core, the performance (measured in terms of end-to-end TCP throughput) is significantly lower than the line rate. Near line rate performance is observed when the flow and the application process are affinitized to different cores belonging to the same socket. Furthermore, affinitizing the application and the flow to cores on different sockets results in significantly lower throughput than the line rate. These results arise due to the memory bottleneck, which is demonstrated using preliminary correlational data on the cache hit rate in the core that services the application process.