Architectural considerations for a new generation of protocols
SIGCOMM '90 Proceedings of the ACM symposium on Communications architectures & protocols
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An effective programmable prefetch engine for on-chip caches
Proceedings of the 28th annual international symposium on Microarchitecture
The design and implementation of the 4.4BSD operating system
The design and implementation of the 4.4BSD operating system
Effects of buffering semantics on I/O performance
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
IEEE Micro
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
An Efficient Zero-Copy I/O Framework for UNIX
An Efficient Zero-Copy I/O Framework for UNIX
Architectural Characterization of TCP/IP Packet Processing on the Pentium® M Microprocessor
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
TCP offload is a dumb idea whose time has come
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
sNICh: efficient last hop networking in the data center
Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Analyzing performance and power efficiency of network processing over 10 GbE
Journal of Parallel and Distributed Computing
Hi-index | 14.98 |
Data movement (memory copies) is a very common operation during network processing and application execution on servers. The performance of this operation is rather poor on today's microprocessors due to the following aspects: 1) Several long-latency memory accesses are involved because the source and/or the destination are typically in memory, 2) latency hiding techniques, such as out-of-order execution, hardware threading, and prefetching, are not very effective for bulk data movement, and 3) microprocessors move data at register (small) granularity. In this paper, we show this overhead of bulk data movement and propose the use of dedicated copy engines to minimize it. We present a detailed analysis of copy engine architectures along two dimensions: 1) on-die versus off-die and 2) synchronous versus asynchronous. These copy engine architectures are superior to traditional Direct Memory Access (DMA) engines because they are tightly coupled to the core architecture and enable lower overhead communication and signaling. We describe the hardware support required to implement these copy engines and integrate them into server platforms. We perform a detailed case study to evaluate the performance of these copy engines. The evaluation is based on an execution-driven simulator, which was extended with detailed models of copy engines. Our simulation results show that copy engines are effective in reducing the bulk data movement overhead and, hence, hold significant promise for high-performance server platforms.