Exploiting locality to ameliorate packet queue contention and serialization

Authors:
Sailesh Kumar;John Maschmeyer;Patrick Crowley
Affiliations:
Washington University, St. Louis, MO;Washington University, St. Louis, MO;Washington University, St. Louis, MO
Venue:
Proceedings of the 3rd conference on Computing frontiers
Year:
2006

Citing 10
Cited 3

Multiprocessor cache design considerations

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Lockup-free caches in high-performance multiprocessors

Journal of Parallel and Distributed Computing
Evaluating Design Choices for Shared Bus Multiprocessors in a Throughput-Oriented Environment

IEEE Transactions on Computers
Efficient fair queueing using deficit round-robin

IEEE/ACM Transactions on Networking (TON)
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
QoS-Sensitive Flows: Issues in IP Packet Handling

IEEE Internet Computing
Efficient use of memory bandwidth to improve network processor throughput

Proceedings of the 30th annual international symposium on Computer architecture
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Frame shared memory: line-rate networking on commodity hardware

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Multi-terabit ip lookup using parallel bidirectional pipelines

Proceedings of the 5th conference on Computing frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Packet processing systems maintain high throughput despite relatively high memory latencies by exploiting the coarse-grained parallelism available between packets. In particular, multiple processors are used to overlap the processing of multiple packets. Packet queuing-the fundamental mechanism enabling packet scheduling, differentiated services, and traffic isolation-requires a read-modify-write operation on a linked list data structure to enqueue and dequeue packets; this operation represents a potential serializing bottleneck. If all packets awaiting service are destined for different queues, these read-modify-write cycles can proceed in parallel. However, if all or many of the incoming packets are destined for the same queue, or for a small number of queues, then system throughput will be serialized by these sequential external memory operations. For this reason, low latency SRAMs are used to implement the queue data structures. This reduces the absolute cost of serialization but does not eliminate it; SRAM latencies determine system throughput.In this paper we observe that the worst-case scenario for packet queuing coincides with the best-case scenario for caches: i.e., when locality exists and the majority of packets are destined for a small number of queues. The main contribution of this work is the queuing cache, which consists of a hardware cache and a closely coupled queuing engine that implements queue operations. The queuing cache improves performance dramatically by moving the bottleneck from external memory onto the packet processor, where clock rates are higher and latencies are lower. We compare the queuing cache to a number of alternatives, specifically, SRAM controllers with: no queuing support, a software-controlled cache plus a queuing engine (like that used on Intel's IXP network processor), and a hardware cache. Relative to these models, we show that a queuing cache improves worst-case throughput by factors of 3.1, 1.5, and 2.1 and the throughput of real-world traffic traces by factors of 2.6, 1.3, and 1.75, respectively. We also show that the queuing cache decreases external memory bandwidth usage, on-chip communication, and the number of queuing instructions executed under best-case, worst-case and real-world traffic workloads. Based on our VHDL models, we conclude that a queuing cache could be implemented at a low cost relative to the resulting performance and efficiency benefits.