Overcoming the memory wall in packet processing: hammers or ladders?

Authors:
Jayaram Mudigonda;Harrick M. Vin;Raj Yavatkar
Affiliations:
University of Texas at Austin;University of Texas at Austin;Intel Corporation
Venue:
Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
Year:
2005

Citing 21
Cited 12

Characteristics of destination address locality in computer networks: a comparison of caching schemes

Computer Networks and ISDN Systems
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Efficient fair queueing using deficit round robin

SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Cache behavior of network protocols

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Scalable high speed IP routing lookups

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
A 50-Gb/s IP router

IEEE/ACM Transactions on Networking (TON)
Fast address lookups using controlled prefix expansion

ACM Transactions on Computer Systems (TOCS)
Characterizing processor architectures for programmable network interfaces

Proceedings of the 14th international conference on Supercomputing
NetBench: a benchmarking suite for network processors

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
Improving route lookup performance using network processor cache

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A pipelined memory architecture for high throughput network processors

Proceedings of the 30th annual international symposium on Computer architecture
Efficient use of memory bandwidth to improve network processor throughput

Proceedings of the 30th annual international symposium on Computer architecture
Memory Hierarchy Design for a Multiprocessor Look-up Engine

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Architectural analysis and instruction-set optimization for design of network protocol processors

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Network Systems Design Using Network Processors

Network Systems Design Using Network Processors
Tree bitmap: hardware/software IP lookups with incremental updates

ACM SIGCOMM Computer Communication Review
Managing memory access latency in packet processing

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
CommBench-a telecommunications benchmark for network processors

ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
Analysis of Network Processing Workloads

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Survey and taxonomy of IP address lookup algorithms

IEEE Network: The Magazine of Global Internetworking
Algorithms for packet classification

IEEE Network: The Magazine of Global Internetworking

Two-level mapping based cache index selection for packet forwarding engines

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Frame shared memory: line-rate networking on commodity hardware

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development

Journal of Systems Architecture: the EUROMICRO Journal
Runtime resource allocation in multi-core packet processing systems

HPSR'09 Proceedings of the 15th international conference on High Performance Switching and Routing
Network interfaces for programmable NICs and multicore platforms

Computer Networks: The International Journal of Computer and Telecommunications Networking
Improving performance of digest caches in network processors

HiPC'08 Proceedings of the 15th international conference on High performance computing
The case for hardware transactional memory in software packet processing

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
400 Gb/s Programmable Packet Parsing on a Single FPGA

Proceedings of the 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems
Advanced packet segmentation and buffering algorithms in network processors

Transactions on High-Performance Embedded Architectures and Compilers IV
A cache architecture for counting bloom filters: theory and application

Journal of Electrical and Computer Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Overhead of memory accesses limits the performance of packet processing applications. To overcome this bottleneck, today's network processors can utilize a wide-range of mechanisms-such as multi-level memory hierarchy, wide-word accesses, special-purpose result-caches, asynchronous memory, and hardware multi-threading. However, supporting all of these mechanisms complicates programmability and hardware design, and wastes systemresources. In this paper, we address the following fundamental question: what minimal set of hardware mechanisms must a network processor support to achieve the twin goals of simplified programmability and high packet throughput? We show that no single mechanism sufficies; the minimal set must include data-caches and multi-threading. Data-caches and multi-threading are complementary; whereas data-caches exploit locality to reduce the number of context-switches and the off-chip memory bandwidth requirement, multi-threading exploits parallelism to hide long cache-miss latencies.