Automatic multithreading and multiprocessing of C programs for IXP

Authors:
Long Li;Bo Huang;Jinquan Dai;Luddy Harrison
Affiliations:
Intel China Software Center, Shanghai, PRC;Intel China Software Center, Shanghai, PRC;Intel China Software Center, Shanghai, PRC;Univ. of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2005

Citing 14
Cited 5

Communicating sequential processes

Communicating sequential processes
Compiler algorithms for synchronization

IEEE Transactions on Computers
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Analysis of event synchronization in a parallel programming tool

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Optimal code motion: theory and practice

ACM Transactions on Programming Languages and Systems (TOPLAS)
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Parallelism for free: efficient and optimal bitvector analyses for parallel programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Points-to analysis in almost linear time

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Eliminating partially dead code in explicitly parallel programs

Theoretical Computer Science - Special issue on parallel computing
Redundant Synchronization Elimination for DOACROSS Loops

IEEE Transactions on Parallel and Distributed Systems
Removing unnecessary synchronization in Java

Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Global optimization by suppression of partial redundancies

Communications of the ACM
Removal of Redundant Dependences in DOACROSS Loops with Constant Dependences

IEEE Transactions on Parallel and Distributed Systems
Optimally Synchronizing DOACROSS Loops on Shared Memory Multiprocessors

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques

Latency hiding through multithreading on a network processor

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Automatic partitioning and mapping of stream-based applications onto the Intel IXP Network processor

SCOPES '07 Proceedingsof the 10th international workshop on Software & compilers for embedded systems
A throughput-driven task creation and mapping for network processors

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers

Quantified Score

Hi-index	0.01

Visualization

Abstract

Effective compilation of packet processing applications onto the Intel IXP network processors requires, among other things, the automatic use of multiple threads on one or more processing elements, and the automatic introduction of synchronization as required to correctly enforce dependences between such threads. We describe the program transformation that is used in the Intel Auto-partitioning C Compiler for IXP to automatically multithread/multi-process a program for the IXP. This transformation consists of steps that introduce inter-thread signaling to enforce dependences, optimize the placement of such signaling, reduce the number of signals in use to the number available in hardware, and transform the initialization code for correct execution in the multithreaded version. Experimental results show that our method provides impressive speedup for six PPSes (Packet Processing Stages) in the widely used NPF IP forwarding benchmarks. For most packet processing stages, our algorithms can achieve almost linear performance improvement after automatic multi-threading transformation. The automatic multi-processing transformation help further boost the speedup of two PPSes.