Communicating sequential processes
Communicating sequential processes
A new approach to the maximum flow problem
STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Loop distribution with arbitrary control flow
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
PYRROS: static task scheduling and code generation for message passing multiprocessors
ICS '92 Proceedings of the 6th international conference on Supercomputing
Efficient network flow based min-cut balanced partitioning
ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Optimal mapping of sequences of data parallel tasks
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication and memory requirements as the basis for mapping task and data parallel programs
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A stream compiler for communication-exposed architectures
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Optimal Processor Assignment for a Class of Pipelined Computations
IEEE Transactions on Parallel and Distributed Systems
Program Partition and Logic Program Analysis
IEEE Transactions on Software Engineering
Scheduling Data-Parallel Computations on Heterogeneous and Time-Shared Environments
Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing
Tamper-resistant whole program partitioning
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Coarse-Grain Pipelining on Multiple FPGA Architectures
FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Effective thread management on network processors with compiler analysis
Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Support for High-Frequency Streaming in CMPs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Latency hiding through multithreading on a network processor
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors
Proceedings of the International Symposium on Code Generation and Optimization
Optimizing software cache performance of packet processing applications
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Program mapping onto network processors by recursive bipartitioning and refining
Proceedings of the 44th annual Design Automation Conference
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Performance scalability of decoupled software pipelining
ACM Transactions on Architecture and Code Optimization (TACO)
SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Design of a scalable network programming framework
Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
A throughput-driven task creation and mapping for network processors
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
LATA: a latency and throughput-aware packet processing system
Proceedings of the 47th Design Automation Conference
The case for hardware transactional memory in software packet processing
Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Compiler assisted dynamic management of registers for network processors
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler-Supported Thread Management for Multithreaded Network Processors
ACM Transactions on Embedded Computing Systems (TECS)
Task assignment for network processor pipelines using GA
APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Using machine learning to partition streaming programs
ACM Transactions on Architecture and Code Optimization (TACO)
An automatic thread decomposition approach for pipelined multithreading
International Journal of High Performance Computing and Networking
Accelerating sequential programs on commodity multi-core processors
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Modern network processors employs parallel processing engines (PEs) to keep up with explosive internet packet processing demands. Most network processors further allow processing engines to be organized in a pipelined fashion to enable higher processing throughput and flexibility. In this paper, we present a novel program transformation technique to exploit parallel and pipelined computing power of modern network processors. Our proposed method automatically partitions a sequential packet processing application into coordinated pipelined parallel subtasks which can be naturally mapped to contemporary high-performance network processors. Our transformation technique ensures that packet processing tasks are balanced among pipeline stages and that data transmission between pipeline stages is minimized. We have implemented the proposed transformation method in an auto-partitioning C compiler product for Intel Network Processors. Experimental results show that our method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks. For a 9-stage pipeline, our auto-partitioning C compiler obtained more than 4X speedup for the IPv4 forwarding PPS and the IP forwarding PPS (for both the IPv4 traffic and IPv6 traffic).