Practice of parallelizing network applications on multi-core architectures

Authors:
Junchang Wang;Haipeng Cheng;Bei Hua;Xinan Tang
Affiliations:
University of Science and Technology of China, Hefei, Anhui, China;University of Science and Technology of China, Hefei, Anhui, China;University of Science and Technology of China, Hefei, Anhui, China;Intel Compiler Lab, Santa Clara, California, USA
Venue:
Proceedings of the 23rd international conference on Supercomputing
Year:
2009

Citing 20
Cited 3

How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Automatically partitioning threads for multithreaded architectures

Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Specifying Concurrent Program Modules

ACM Transactions on Programming Languages and Systems (TOPLAS)
Mostly lock-free malloc

Proceedings of the 3rd international symposium on Memory management
Heap Analysis And Optimizations For Threaded Programs

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Scalable lock-free dynamic memory allocation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
TCP Onloading for Data Center Servers

Computer
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Symerton--using virtualization to accelerate packet processing

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Expressing and exploiting concurrency in networked applications with aspen

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Concurrent programming without locks

ACM Transactions on Computer Systems (TOCS)
malloc() performance in a multithreaded Linux environment

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Frame shared memory: line-rate networking on commodity hardware

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
High-performance packet classification algorithm for multithreaded IXP network processor

ACM Transactions on Embedded Computing Systems (TECS)
Scalable packet classification using interpreting: a cross-platform multi-core solution

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scheduling strategies for optimistic parallel execution of irregular programs

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine

IEEE Micro
Network Processing on an SPE Core in Cell Broadband Engine

HOTI '08 Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects
Conservative vs. Optimistic Parallelization of Stateful Network Intrusion Detection

ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

The case for hardware transactional memory in software packet processing

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
A lock-free, cache-efficient shared ring buffer for multi-core architectures

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
MCA2: multi-core architecture for mitigating complexity attacks

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The industry wide shift to multi-core architectures arouses great interests in parallelizing sequential applications. However, it is very difficult to parallelize fine-grained applications for multi-core architectures due to insufficient hardware support of fast communication and synchronization. Fortunately, network applications can be decomposed into pipelined structures that are amenable to streaming based parallel processing. To realize the potential of pipelining on multi-core architectures, it requires reevaluating the basic tradeoffs in parallel processing, including the ones between load balance and data locality and between general lock mechanisms and special lock-free data structures. This paper presents the practice of building a high-performance multi-core based network processing platform in which connection-affinity and lock-free design principles are applied effectively for better data locality and faster core-to-core synchronization and communication. We parallelize a complete Layer 2 to Layer 7 (L2-L7) network processing system on an Intel Core 2 Quad processor, including a TCP/IP stack based on Libnids (L2-L4) and a port-independent protocol identification engine by deep packet inspection (L7+). Furthermore, we develop a compiling method to transform sequential network applications to parallel ones to enable those applications to run on multi-core architectures. Our experience suggests that (1) fine-grained pipelining can be a good software solution for parallelizing network applications on multi-core architectures if connection-affinity and lock-free are used as the first design principles; (2) a delicate partitioning scheme is required to map pipelined structures onto specific multi-core architecture; (3) an automatic parallelization approach can work if domain knowledge is considered in the parallelizing process. Our multi-core based network processing platform can deliver not only 6Gbps processing speed for large packet sizes but also more challenging 2Gbps speed for smaller packets.