Thread partitioning and scheduling based on cost model
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Small forwarding tables for fast routing lookups
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Fast and scalable layer four switching
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
High-speed policy-based packet forwarding using efficient multi-dimensional range matching
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
Packet classification on multiple fields
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Automatically partitioning threads for multithreaded architectures
Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Scalable packet classification
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A pipelined memory architecture for high throughput network processors
Proceedings of the 30th annual international symposium on Computer architecture
Packet classification using multidimensional cutting
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Tree bitmap: hardware/software IP lookups with incremental updates
ACM SIGCOMM Computer Communication Review
Adaptive Cache Compression for High-Performance Processors
Proceedings of the 31st annual international symposium on Computer architecture
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
High-performance packet classification algorithm for many-core and multithreaded network processor
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
On RTP filtering for network traffic reduction
Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia
Practice of parallelizing network applications on multi-core architectures
Proceedings of the 23rd international conference on Supercomputing
Balanced HiCuts: an optimized packet classification algorithm
ICCOMP'09 Proceedings of the WSEAES 13th international conference on Computers
Three different designs for packet classification
WSEAS Transactions on Computers
Hashing round-down prefixes for rapid packet classification
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Hi-index | 0.00 |
Packet classification is an enabling technology to support advanced Internet services. It is still a challenge for a software solution to achieve 10Gbps (line-rate) classification speed. This paper presents a classification algorithm that can be efficiently implemented on a multi-core architecture with or without cache. The algorithm embraces the holistic notion of exploiting application characteristics, considering the capabilities of the CPU and the memory hierarchy, and performing appropriate data partitioning. The classification algorithm adopts two stages: searching on a reduction tree and searching on a list of ranges. This decision is made based on a classification heuristic: the size of the range list is limited after the first stage search. Optimizations are then designed to speed up the two-stage execution. To exploit the speed gap (1) between the CPU and external memory; (2) between internal memory (cache) and external memory, an interpreter is used to trade the CPU idle cycles with demanding memory access requirements. By applying the CISC style of instruction encoding to compress the range expressions, it not only significantly reduces the total memory requirement but also makes effective use of the internal memory (cache) bandwidth. We show that compressing data structures is an effective optimization across the multi-core architectures. We implement this algorithm on both Intel IXP2800 network processor and Core 2 Duo X86 architecture, and experiment with the classification benchmark, ClassBench. By incorporating architecture-awareness in algorithm design and taking into account the memory hierarchy, data partitioning, and latency hiding in algorithm implementation, the resulting algorithm shows a good scalability on Intel IXP2800. By effectively using the cache system, the algorithm also runs faster than the previous fastest RFC on the Core 2 Duo architecture.