How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Automatically partitioning threads for multithreaded architectures
Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Specifying Concurrent Program Modules
ACM Transactions on Programming Languages and Systems (TOPLAS)
Proceedings of the 3rd international symposium on Memory management
Heap Analysis And Optimizations For Threaded Programs
PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Scalable lock-free dynamic memory allocation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Symerton--using virtualization to accelerate packet processing
Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Expressing and exploiting concurrency in networked applications with aspen
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Concurrent programming without locks
ACM Transactions on Computer Systems (TOCS)
malloc() performance in a multithreaded Linux environment
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Frame shared memory: line-rate networking on commodity hardware
Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
High-performance packet classification algorithm for multithreaded IXP network processor
ACM Transactions on Embedded Computing Systems (TECS)
Scalable packet classification using interpreting: a cross-platform multi-core solution
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Scheduling strategies for optimistic parallel execution of irregular programs
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Network Processing on an SPE Core in Cell Broadband Engine
HOTI '08 Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects
Conservative vs. Optimistic Parallelization of Stateful Network Intrusion Detection
ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
The case for hardware transactional memory in software packet processing
Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
A lock-free, cache-efficient shared ring buffer for multi-core architectures
Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
MCA2: multi-core architecture for mitigating complexity attacks
Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Hi-index | 0.00 |
The industry wide shift to multi-core architectures arouses great interests in parallelizing sequential applications. However, it is very difficult to parallelize fine-grained applications for multi-core architectures due to insufficient hardware support of fast communication and synchronization. Fortunately, network applications can be decomposed into pipelined structures that are amenable to streaming based parallel processing. To realize the potential of pipelining on multi-core architectures, it requires reevaluating the basic tradeoffs in parallel processing, including the ones between load balance and data locality and between general lock mechanisms and special lock-free data structures. This paper presents the practice of building a high-performance multi-core based network processing platform in which connection-affinity and lock-free design principles are applied effectively for better data locality and faster core-to-core synchronization and communication. We parallelize a complete Layer 2 to Layer 7 (L2-L7) network processing system on an Intel Core 2 Quad processor, including a TCP/IP stack based on Libnids (L2-L4) and a port-independent protocol identification engine by deep packet inspection (L7+). Furthermore, we develop a compiling method to transform sequential network applications to parallel ones to enable those applications to run on multi-core architectures. Our experience suggests that (1) fine-grained pipelining can be a good software solution for parallelizing network applications on multi-core architectures if connection-affinity and lock-free are used as the first design principles; (2) a delicate partitioning scheme is required to map pipelined structures onto specific multi-core architecture; (3) an automatic parallelization approach can work if domain knowledge is considered in the parallelizing process. Our multi-core based network processing platform can deliver not only 6Gbps processing speed for large packet sizes but also more challenging 2Gbps speed for smaller packets.