Thread partitioning and scheduling based on cost model
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Small forwarding tables for fast routing lookups
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Fast and scalable layer four switching
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
High-speed policy-based packet forwarding using efficient multi-dimensional range matching
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
Packet classification on multiple fields
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Automatically partitioning threads for multithreaded architectures
Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Scalable packet classification
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A pipelined memory architecture for high throughput network processors
Proceedings of the 30th annual international symposium on Computer architecture
Packet classification using multidimensional cutting
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Programming challenges in network processor deployment
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Tree bitmap: hardware/software IP lookups with incremental updates
ACM SIGCOMM Computer Communication Review
IBM PowerNP network processor: Hardware, software, and applications
IBM Journal of Research and Development
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Towards high-performance flow-level packet processing on multi-core network processors
Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
Scalable packet classification using interpreting: a cross-platform multi-core solution
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
SIP server performance on multicore systems
IBM Journal of Research and Development
Hi-index | 0.00 |
Packet classification is crucial for the Internet to provide more value-added services and guaranteed quality of service. Besides hardware-based solutions, many software-based classification algorithms have been proposed. However, classifying at 10Gbps speed or higher is a challenging problem and it is still one of the performance bottlenecks in core routers. In general, classification algorithms face the same challenge of balancing between high classification speed and low memory requirements. This paper proposes a modified Recursive Flow Classification (RFC) algorithm, Bitmap-RFC, which significantly reduces the memory requirements of RFC by applying a bitmap compression technique. To speed up classifying speed, we experiment on exploiting the architectural features of a many-core and multithreaded architecture from algorithm design to algorithm implementation. As a result, Bitmap-RFC strikes a good balance between speed and space. It can not only keep high classification speed but also reduce memory space significantly.This paper investigates the main NPU software design aspects that have dramatic performance impacts on any NPU-based implementations: memory space reduction, instruction selection, data allocation, task partitioning, and latency hiding. We experiment with an architecture-aware design principle to guarantee the high performance of the classification algorithm on an NPU implementation. The experimental results show that the Bitmap-RFC algorithm achieves 10Gbps speed or higher and has a good scalability on Intel IXP2800 NP.