Hardware spatial forwarding for widely shared data
Proceedings of the 14th international conference on Supercomputing
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors
IEEE Transactions on Computers
Exploiting Network Locality for CC-NUMA Multiprocessors
The Journal of Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Distance-aware L2 cache organizations for scalable multiprocessor systems
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Reconfigurable embedded systems: Synthesis, design and application
An efficient cache design for scalable glueless shared-memory multiprocessors
Proceedings of the 3rd conference on Computing frontiers
Routing Table Partitioning for Speedy Packet Lookups in Scalable Routers
IEEE Transactions on Parallel and Distributed Systems
Journal of Parallel and Distributed Computing
An Efficient Lightweight Shared Cache Design for Chip Multiprocessors
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Hi-index | 0.00 |
Cache coherent non-uniform memory access (CC-NUMA) multiprocessors continue to suffer from remote memory access latencies due to comparatively slow memory technology and data transfer latencies in the interconnection network. In this paper, we propose a novel hardware caching technique, called switch cache. The main idea is to implement small fast caches in crossbar switches of the interconnect medium to capture and store shared data as they flow from the memory module to the requesting processor. This stored data acts as a cache for subsequent requests, thus reducing the latency of remote memory accesses tremendously. The implementation of a cache in a crossbar switch needs to be efficient and robust, yet flexible for changes in the caching protocol. The design and implementation details of a CAche Embedded Switch ARchitecture, CAESAR, using wormhole routing with virtual channels is presented. Using detailed execution-driven simulations, we find that the CAESAR switch cache is capable of improving the performance of CC-NUMA multiprocessors by reducing the number of reads served at distant remote memories by up to 45% and improving the application execution time by as high as 20%. We conclude that the switch caches provide a cost-effective solution for designing high performance CC-NUMA multiprocessors.