A class of compatible cache consistency protocols and their support by the IEEE futurebus
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An evaluation of directory schemes for cache coherence
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Wisconsin multicube: a new large-scale cache-coherent multiprocessor
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
The Stanford Dash Multiprocessor
Computer
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Low-latency message communication support for the AP1000
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
A tightly-coupled processor-network interface
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Closing the window of vulnerability in multiphase memory transactions
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Anatomy of a message in the Alewife multiprocessor
ICS '93 Proceedings of the 7th international conference on Supercomputing
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The SPARC architecture manual (version 9)
The SPARC architecture manual (version 9)
IBM Power and PowerPC
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Where is time spent in message-passing and shared-memory programs?
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Efficient support for irregular applications on distributed-memory machines
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Remote queues: exposing message queues for optimization and atomicity
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Boosting the performance of hybrid snooping cache protocols
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupled hardware support for distributed shared memory
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Supporting systolic and memory communication in iWarp
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Application-specific protocols for user-level shared memory
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Cost-Effective Parallel Computing
Computer
Integrating Networks and Memory Hierarchies in a Multicomputer Node Architecture
Proceedings of the 8th International Symposium on Parallel Processing
START-NG: Delivering Seamless Parallel Computing
Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
Tempest: a substrate for portable parallel programs
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Protected, user-level DMA for the SHRIMP network interface
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor
FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor
pSNOW: a tool to evaluate architectural issues for NOW environments
ICS '97 Proceedings of the 11th international conference on Supercomputing
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols
Proceedings of the 25th annual international symposium on Computer architecture
Adapting the Network Interface for High-Performance Computing: The CNI Approach
The Journal of Supercomputing - Special issue: high performance distributed computing
Removing the overhead from software-based shared memory
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Spinach: a liberty-based simulator for programmable network interface architectures
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Store-Ordered Streaming of Shared Memory
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Network Interface Data Caching
IEEE Transactions on Computers
High-performance ethernet-based communications for future multi-core processors
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
A new TCB cache to efficiently manage TCP sessions for web servers
Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Seekable sockets: a mechanism to reduce copy overheads in TCP-based messaging
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Log-based architectures: using multicore to help software behave correctly
ACM SIGOPS Operating Systems Review
Hardware acceleration of transactional memory on commodity systems
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
NP-SARC: Scalable network processing in the SARC multi-core FPGA platform
Journal of Systems Architecture: the EUROMICRO Journal
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.