Direct Cache Access for High Bandwidth Network I/O

Authors:
Ram Huggahalli;Ravi Iyer;Scott Tetrick
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation
Venue:
Proceedings of the 32nd annual international symposium on Computer Architecture
Year:
2005

Citing 5
Cited 30

Architectural considerations for a new generation of protocols

SIGCOMM '90 Proceedings of the ACM symposium on Communications architectures & protocols
Cache behavior of network protocols

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
TCP Onloading for Data Center Servers

Computer
TCP offload is a dumb idea whose time has come

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
End system optimizations for high-speed TCP

IEEE Communications Magazine

Performance Analysis of System Overheads in TCP/IP Workloads

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Lazy direct-to-cache transfer during receive operations in a message passing environment

Proceedings of the 3rd conference on Computing frontiers
Integrated network interfaces for high-bandwidth TCP/IP

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Evaluating network processing efficiency with processor partitioning and asynchronous I/O

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Characterization of network processing overheads in Xen

VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
Can software routers scale?

Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow
Software techniques to improve virtualized I/O performance on multi-core systems

Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Achieving 10 Gb/s using safe and transparent network interface virtualization

Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
NIC-Assisted Cache-Efficient Receive Stack for Message Passing over Ethernet

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A TCP offload engine emulator for estimating the impact of removing protocol processing from a host running Apache HTTP server

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environments

Microprocessors & Microsystems
Instruction-level simulation of a cluster at scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Architectural breakdown of end-to-end latency in a TCP/IP network

International Journal of Parallel Programming - Special issue on the 19th international symposium on computer architecture and high performance computing (SBAC-PAD 2007)
Achieving 10Gbps network processing: are we there yet?

HiPC'08 Proceedings of the 15th international conference on High performance computing
A new TCB cache to efficiently manage TCP sessions for web servers

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
sNICh: efficient last hop networking in the data center

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Cost-effectively offering private buffers in SoCs and CMPs

Proceedings of the international conference on Supercomputing
Cache injection for parallel applications

Proceedings of the 20th international symposium on High performance distributed computing
Introspective end-system modeling to optimize the transfer time of rate based protocols

Proceedings of the 20th international symposium on High performance distributed computing
Receive side coalescing for accelerating TCP/IP processing

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Affinity-aware DMA buffer management for reducing off-chip memory access

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Improving the throughput and delay performance of network processors by applying push model

Proceedings of the 2012 IEEE 20th International Workshop on Quality of Service
Analyzing performance and power efficiency of network processing over 10 GbE

Journal of Parallel and Distributed Computing
Comparing direct-to-cache transfer policies to TCP/IP and M-VIA during receive operations in MPI environments

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Cache-aware affinitization on commodity multicores for high-speed network flows

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework

Journal of Parallel and Distributed Computing
Thin servers with smart pipes: designing SoC accelerators for memcached

Proceedings of the 40th Annual International Symposium on Computer Architecture
We need to talk about NICs

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flows

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Network interface design for low latency request-response protocols

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent I/O technologies such as PCI-Express and 10Gb Ethernet enable unprecedented levels of I/O bandwidths in mainstream platforms. However, in traditional architectures, memory latency alone can limit processors from matching 10 Gb inbound network I/O traffic. We propose a platform-wide method called Direct Cache Access (DCA) to deliver inbound I/O data directly into processor caches. We demonstrate that DCA provides a significant reduction in memory latency and memory bandwidth for receive intensive network I/O applications. Analysis of benchmarks such as SPECWeb9, TPC-W and TPC-C shows that overall benefit depends on the relative volume of I/O to memory traffic as well as the spatial and temporal relationship between processor and I/O memory accesses. A system level perspective for the efficient implementation of DCA is presented.