Automatically partitioning packet processing applications for pipelined architectures

Authors:
Jinquan Dai;Bo Huang;Long Li;Luddy Harrison
Affiliations:
Intel China Software Center, Shanghai, PRC;Intel China Software Center, Shanghai, PRC;Intel China Software Center, Shanghai, PRC;Univ. of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Year:
2005

Citing 18
Cited 22

Communicating sequential processes

Communicating sequential processes
A new approach to the maximum flow problem

STOC '86 Proceedings of the eighteenth annual ACM symposium on Theory of computing
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Loop distribution with arbitrary control flow

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
PYRROS: static task scheduling and code generation for message passing multiprocessors

ICS '92 Proceedings of the 6th international conference on Supercomputing
Efficient network flow based min-cut balanced partitioning

ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Optimal mapping of sequences of data parallel tasks

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication and memory requirements as the basis for mapping task and data parallel programs

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Optimal Processor Assignment for a Class of Pipelined Computations

IEEE Transactions on Parallel and Distributed Systems
Program Partition and Logic Program Analysis

IEEE Transactions on Software Engineering
Scheduling Data-Parallel Computations on Heterogeneous and Time-Shared Environments

Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing
Tamper-resistant whole program partitioning

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Coarse-Grain Pipelining on Multiple FPGA Architectures

FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Shangri-La: achieving high performance from compiled network applications while enabling ease of programming

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Shangri-La: achieving high performance from compiled network applications while enabling ease of programming

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Effective thread management on network processors with compiler analysis

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Latency hiding through multithreading on a network processor

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Program mapping onto network processors by recursive bipartitioning and refining

Proceedings of the 44th annual Design Automation Conference
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Design of a scalable network programming framework

Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Model-Driven Engineering from Modular Monadic Semantics: Implementation Techniques Targeting Hardware and Software

DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
A throughput-driven task creation and mapping for network processors

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
LATA: a latency and throughput-aware packet processing system

Proceedings of the 47th Design Automation Conference
The case for hardware transactional memory in software packet processing

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Compiler assisted dynamic management of registers for network processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler-Supported Thread Management for Multithreaded Network Processors

ACM Transactions on Embedded Computing Systems (TECS)
Task assignment for network processor pipelines using GA

APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Using machine learning to partition streaming programs

ACM Transactions on Architecture and Code Optimization (TACO)
An automatic thread decomposition approach for pipelined multithreading

International Journal of High Performance Computing and Networking
Accelerating sequential programs on commodity multi-core processors

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern network processors employs parallel processing engines (PEs) to keep up with explosive internet packet processing demands. Most network processors further allow processing engines to be organized in a pipelined fashion to enable higher processing throughput and flexibility. In this paper, we present a novel program transformation technique to exploit parallel and pipelined computing power of modern network processors. Our proposed method automatically partitions a sequential packet processing application into coordinated pipelined parallel subtasks which can be naturally mapped to contemporary high-performance network processors. Our transformation technique ensures that packet processing tasks are balanced among pipeline stages and that data transmission between pipeline stages is minimized. We have implemented the proposed transformation method in an auto-partitioning C compiler product for Intel Network Processors. Experimental results show that our method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks. For a 9-stage pipeline, our auto-partitioning C compiler obtained more than 4X speedup for the IPv4 forwarding PPS and the IP forwarding PPS (for both the IPv4 traffic and IPv6 traffic).