On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Authors:
Stamatis G. Kavadias;Manolis G.H. Katevenis;Michail Zampetakis;Dimitrios S. Nikolopoulos
Affiliations:
Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece;Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece;Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece;Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece
Venue:
Proceedings of the 7th ACM international conference on Computing frontiers
Year:
2010

Citing 22
Cited 3

Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
The Stanford Dash Multiprocessor

Computer
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Concurrent Event Handling through Multithreading

IEEE Transactions on Computers
Reconfigurable caches and their application to media processing

Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Application-specific protocols for user-level shared memory

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Simics: A Full System Simulation Platform

Computer
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Language and compiler design for streaming applications

International Journal of Parallel Programming - Special issue: The next generation software program
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Data Prefetching and Data Forwarding in Shared Memory Multiprocessors

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
Architectural Support for the Stream Execution Model on General-Purpose Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator

IEEE Micro
A memory system design framework: creating smart memories

Proceedings of the 36th annual international symposium on Computer architecture
CLOMP: accurately characterizing OpenMP application overheads

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation

Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
NP-SARC: Scalable network processing in the SARC multi-core FPGA platform

Journal of Systems Architecture: the EUROMICRO Journal
Arbitration of many thousand flows at 100G and beyond

Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip

Quantified Score

Hi-index	0.00

Visualization

Abstract

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.