On-chip communication and synchronization mechanisms with cache-integrated network interfaces

  • Authors:
  • Stamatis G. Kavadias;Manolis G.H. Katevenis;Michail Zampetakis;Dimitrios S. Nikolopoulos

  • Affiliations:
  • Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece;Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece;Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece;Foundation for Research and Technology Hellas (FORTH), Heraklion, Crete, Greece

  • Venue:
  • Proceedings of the 7th ACM international conference on Computing frontiers
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.