Thread partitioning and scheduling based on cost model
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Small forwarding tables for fast routing lookups
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm?
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Fast and scalable layer four switching
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
High-speed policy-based packet forwarding using efficient multi-dimensional range matching
Proceedings of the ACM SIGCOMM '98 conference on Applications, technologies, architectures, and protocols for computer communication
Packet classification on multiple fields
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Automatically partitioning threads for multithreaded architectures
Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Scalable packet classification
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A pipelined memory architecture for high throughput network processors
Proceedings of the 30th annual international symposium on Computer architecture
Packet classification using multidimensional cutting
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Programming challenges in network processor deployment
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Tree bitmap: hardware/software IP lookups with incremental updates
ACM SIGCOMM Computer Communication Review
IBM PowerNP network processor: Hardware, software, and applications
IBM Journal of Research and Development
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
On RTP filtering for network traffic reduction
Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia
Practice of parallelizing network applications on multi-core architectures
Proceedings of the 23rd international conference on Supercomputing
Hint-based cache design for reducing miss penalty in HBS packet classification algorithm
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Packet classification is crucial for the Internet to provide more value-added services and guaranteed quality of service. Besides hardware-based solutions, many software-based classification algorithms have been proposed. However, classifying at 10 Gbps speed or higher is a challenging problem and it is still one of the performance bottlenecks in core routers. In general, classification algorithms face the same challenge of balancing between high classification speed and low memory requirements. This paper proposes a modified recursive flow classification (RFC) algorithm, Bitmap-RFC, which significantly reduces the memory requirements of RFC by applying a bitmap compression technique. To speed up classifying speed, we exploit the multithreaded architectural features in various algorithm development stages from algorithm design to algorithm implementation. As a result, Bitmap-RFC strikes a good balance between speed and space. It can significantly keep both high classification speed and reduce memory space consumption. This paper investigates the main NPU software design aspects that have dramatic performance impacts on any NPU-based implementations: memory space reduction, instruction selection, data allocation, task partitioning, and latency hiding. We experiment with an architecture-aware design principle to guarantee the high performance of the classification algorithm on an NPU implementation. The experimental results show that the Bitmap-RFC algorithm achieves 10 Gbps speed or higher and has a good scalability on Intel IXP2800 NPU.