Inexpensive implementations of set-associativity
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Anatomy of a message in the Alewife multiprocessor
ICS '93 Proceedings of the 7th international conference on Supercomputing
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Remote queues: exposing message queues for optimization and atomicity
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Coherent network interfaces for fine-grain communication
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Smart Memories: a modular reconfigurable architecture
Proceedings of the 27th annual international symposium on Computer architecture
Reducing set-associative cache energy via way-prediction and selective direct-mapping
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
Telegraphos: High-Performance Networking for Parallel Processing on Workstation Clusters
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Programmable Stream Processors
Computer
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
On-chip communication and synchronization mechanisms with cache-integrated network interfaces
Proceedings of the 7th ACM international conference on Computing frontiers
CoRAM: an in-fabric memory architecture for FPGA-based computing
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
NP-SARC: Scalable network processing in the SARC multi-core FPGA platform
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
We report on the hardware implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). It intends to support both implicit communication, via caches, and explicit communication, via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks near each processor, so that part of them operates as 2nd level (local) cache, while the rest operates as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor communicates with the NI in user-level, through virtualized command areas in scratchpad; through a similar mechanism, the NI also provides efficient support for synchronization, using two hardware primitives: counters, and queues. We describe the block diagram, the hardware cost, and the latencies of our FPGA-based prototype implementation, which integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller on a Xilinx-5 FPGA. One-way, end-to-end, user-level communication completes within about 30 clock cycles for short transfer sizes.