Efficient fair queueing using deficit round robin
SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
ACM Transactions on Computer Systems (TOCS)
Approximation Algorithms for Scheduling Problems
Approximation Algorithms for Scheduling Problems
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
On runtime management in multi-core packet processing systems
Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
MultiLayer processing - an execution model for parallel stateful packet processing
Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
LATA: a latency and throughput-aware packet processing system
Proceedings of the 47th Design Automation Conference
SIP server performance on multicore systems
IBM Journal of Research and Development
Hi-index | 0.00 |
Chip multiprocessor designs are the most common types of architectures seen in Network Processors. As the Network Processors are used to implement increasingly complicated applications, task distribution among the cores is becoming an important problem. In this paper, we propose a new task allocation scheme for such architectures. This scheme relies on the inherent modular nature of the networking applications and intelligently distributes modules among different execution cores. Additionally, we selectively replicate modules to parallelize execution of tasks having longer processing time. We have developed a technique that uses the probability distribution of the execution times of different modules in the networking applications. The proposed schemes result in resource utilization of up to 95%, 89%, and 84% on average for the processors with 2, 4, and 8 cores, respectively. The schemes are highly scalable and can improve the throughput by 6.72 times for 8 core processors, aggregated over four representative applications. The combination of selective replication of modules and variation-aware task allocation result in up to 12.5% (9.9% on average) performance improvement as compared to a scheme based on just mean processing time.