ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
High performance network virtualization with SR-IOV
Journal of Parallel and Distributed Computing
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Rethinking network stack design with memory snapshots
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows
NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Improving server application performance via pure TCP ACK receive optimization
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
High-Performance network traffic processing systems using commodity hardware
DataTraffic Monitoring and Analysis
Hi-index | 0.00 |
Traditional architectural designs are normally focused on CPUs and have been often decoupled from I/O considerations. They are inefficient for high-speed network processing with a bandwidth of 10Gbps and beyond. Long latency I/O interconnects on mainstream servers also substantially complicate the NIC designs. In this paper, we start with fine-grained driver and OS instrumentation to fully understand the network processing overhead over 10GbE on mainstream servers. We obtain several new findings: 1) besides data copy identified by previous works, the driver and buffer release are two unexpected major overheads (up to 54%); 2) the major source of the overheads is memory stalls and data relating to socket buffer (SKB) and page data structures are mainly responsible for the stalls; 3) prevailing platform optimizations like Direct Cache Access (DCA) are insufficient for addressing the network processing bottlenecks. Motivated by the studies, we propose a new server I/O architecture where DMA descriptor management is shifted from NICs to an on-chip network engine (NEngine), and descriptors are extended with information about data incurring memory stalls. NEngine relies on data lookups and preloads data to eliminate the stalls during network processing. Moreover, NEngine implements efficient packet movement inside caches to address the remaining issues in data copy. The new architecture allows DMA engine to have very fast access to descriptors and keeps packets in CPU caches instead of NIC buffers, significantly simplifying NICs. Experimental results demonstrate that the new server I/O architecture improves the network processing efficiency by 47% and web server throughput by 14%, while substantially reducing the NIC hardware complexity.