Shangri-La: achieving high performance from compiled network applications while enabling ease of programming

Authors:
Michael K. Chen;Xiao Feng Li;Ruiqi Lian;Jason H. Lin;Lixia Liu;Tao Liu;Roy Ju
Affiliations:
Intel Corporation, Santa Clara, CA;Intel China Research Center Ltd., Beijing, China;China Academy of Sciences, Beijing, China;Intel China Research Center Ltd., Beijing, China;Intel China Research Center Ltd., Beijing, China;China Academy of Sciences, Beijing, China;Intel Corporation, Santa Clara, CA
Venue:
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Year:
2005

Citing 21
Cited 42

TCP/IP illustrated (vol. 1): the protocols

TCP/IP illustrated (vol. 1): the protocols
Memory access coalescing: a technique for eliminating redundant memory accesses

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
A new algorithm for partial redundancy elimination based on SSA form

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Type-based alias analysis

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
The click modular router

ACM Transactions on Computer Systems (TOCS)
Performance modeling for fast IP lookups

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Experience with a retargetable compiler for a commercial network processor

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
An optimal memory allocation scheme for scratch-pad-based embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Programming language optimizations for modular router configurations

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Representation for Bit Section Based Analysis and Optimization

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Flexible Control of Parallelism in a Multiprocessor PC Router

Proceedings of the General Track: 2002 USENIX Annual Technical Conference
Taming the IXP network processor

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Ixp2400-2800 Programming: The Complete Microengine Coding Guide

Ixp2400-2800 Programming: The Complete Microengine Coding Guide
Memory Hierarchy Design for a Multiprocessor Look-up Engine

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Simple offset assignment in presence of subword data

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Programming challenges in network processor deployment

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Compiler-decided dynamic memory allocation for scratch-pad based embedded systems

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Balancing register allocation across threads for a multithreaded network processor

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Framework for supporting multi-service edge packet processing on network processors

Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
High-performance IPv6 forwarding algorithm for multi-core and multithreaded network processor

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Effective thread management on network processors with compiler analysis

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Expressing and exploiting concurrency in networked applications with aspen

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
FEADS: a framework for exploring the application design space on network processors

International Journal of Parallel Programming
Interactive presentation: Hard- and software modularity of the NOVA MPSoC platform

Proceedings of the conference on Design, automation and test in Europe
ILP and heuristic techniques for system-level design on network processor architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Program mapping onto network processors by recursive bipartitioning and refining

Proceedings of the 44th annual Design Automation Conference
Automated task distribution in multicore network processors using statistical analysis

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Design and implementation of a framework for creating portable and efficient packet-processing applications

EMSOFT '08 Proceedings of the 8th ACM international conference on Embedded software
Optimus: efficient realization of streaming applications on FPGAs

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Design of a scalable network programming framework

Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
A remotely accessible network processor-based router for network experimentation

Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Stream Compilation for Real-Time Embedded Multicore Systems

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
An intrusion detection sensor for the NetVM virtual processor

ICOIN'09 Proceedings of the 23rd international conference on Information Networking
Runtime resource allocation in multi-core packet processing systems

HPSR'09 Proceedings of the 15th international conference on High Performance Switching and Routing
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A throughput-driven task creation and mapping for network processors

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
A time division multiplexing (TDM) logic mapping method for computational applications

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part I
Balanced bipartite graph based register allocation for network processors in mobile and wireless networks

Mobile Information Systems - Mobile and Wireless Networks
LATA: a latency and throughput-aware packet processing system

Proceedings of the 47th Design Automation Conference
An empirical characterization of stream programs and its implications for language and compiler design

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Compilation of stream programs for multicore processors that incorporate scratchpad memories

Proceedings of the Conference on Design, Automation and Test in Europe
Compiler assisted dynamic management of registers for network processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Transformation-based parallelization of request-processing applications

MODELS'10 Proceedings of the 13th international conference on Model driven engineering languages and systems: Part II
Orchestration by approximation: mapping stream programs onto multicore architectures

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Frenetic: a network programming language

Proceedings of the 16th ACM SIGPLAN international conference on Functional programming
Compiler-Supported Thread Management for Multithreaded Network Processors

ACM Transactions on Embedded Computing Systems (TECS)
400 Gb/s Programmable Packet Parsing on a Single FPGA

Proceedings of the 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems
Optimizing packet accesses for a domain specific language on network processors

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A compiler and run-time system for network programming languages

POPL '12 Proceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Supporting reconfigurable parallel multimedia applications

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A register allocation framework for banked register files with access constraints

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Profile-guided deployment of stream programs on multicores

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
StreamPI: a stream-parallel programming extension for object-oriented programming languages

The Journal of Supercomputing
Dynamic scheduling of stream programs on embedded multi-core processors

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Editorial: Recent developments in high performance computing and security: An editorial

Future Generation Computer Systems
Sigma*: symbolic learning of input-output specifications

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

Programming network processors is challenging. To sustain high line rates, network processors have extremely tight memory access and instruction budgets. Achieving desired performance has traditionally required hand-coded assembly. Researchers have recently proposed high-level programming languages for packet processing, but the challenges of compiling these languages into code that is competitive with hand-tuned assembly remain unanswered.This paper describes the Shangri-La compiler, which accepts a packet program written in a C-like high-level language and applies scalar and specialized optimizations to generate a highly optimized binary. Hot code paths identified by profiling are mapped across processing elements to maximize processor utilization. Since our compilation target has no hardware caches, software-controlled caches are generated for frequently accessed application data structures. Packet handling optimizations significantly reduce per-packet memory access and instruction counts. Finally, a custom stack model maps stack frames to the fastest levels of the target processor's heterogeneous memory hierarchy.Binaries generated by the compiler were evaluated on the Intel IXP2400 network processor with eight packet processing cores and eight threads per core. Our results show the importance of both traditional and specialized optimization techniques for achieving the maximum forwarding rates on three network applications, L3-Switch, MPLS and Firewall.