FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability

Authors:
George Kalokerinos;Vassilis Papaefstathiou;George Nikiforos;Stamatis Kavadias;Manolis Katevenis;Dionisios Pnevmatikatos;Xiaojun Yang
Affiliations:
Institute of Computer Science, FORTH, Heraklion, Crete, Greece;Institute of Computer Science, FORTH, Heraklion, Crete, Greece;Institute of Computer Science, FORTH, Heraklion, Crete, Greece;Institute of Computer Science, FORTH, Heraklion, Crete, Greece;Institute of Computer Science, FORTH, Heraklion, Crete, Greece;Institute of Computer Science, FORTH, Heraklion, Crete, Greece;Institute of Computer Science, FORTH, Heraklion, Crete, Greece
Venue:
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Year:
2009

Citing 16
Cited 3

Inexpensive implementations of set-associativity

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Remote queues: exposing message queues for optimization and atomicity

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Reducing set-associative cache energy via way-prediction and selective direct-mapping

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
User-Level Network Interface Protocols

Computer
Scratchpad memory: design alternative for cache on-chip memory in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
Telegraphos: High-Performance Networking for Parallel Processing on Workstation Clusters

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Programmable Stream Processors

Computer
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers
CoRAM: an in-fabric memory architecture for FPGA-based computing

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
NP-SARC: Scalable network processing in the SARC multi-core FPGA platform

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on the hardware implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). It intends to support both implicit communication, via caches, and explicit communication, via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks near each processor, so that part of them operates as 2nd level (local) cache, while the rest operates as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor communicates with the NI in user-level, through virtualized command areas in scratchpad; through a similar mechanism, the NI also provides efficient support for synchronization, using two hardware primitives: counters, and queues. We describe the block diagram, the hardware cost, and the latencies of our FPGA-based prototype implementation, which integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller on a Xilinx-5 FPGA. One-way, end-to-end, user-level communication completes within about 30 clock cycles for short transfer sizes.