Introduction to algorithms
High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The DASH prototype: implementation and performance
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The GLOW cache coherence protocol extensions for widely shared data
ICS '96 Proceedings of the 10th international conference on Supercomputing
Tango: a hardware-based data prefetching technique for superscalar processors
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data Forwarding in Scalable Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Designing high bandwidth on-chip caches
Proceedings of the 24th annual international symposium on Computer architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Performance Characterization of the Pentium® Pro Processor
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Circular Buffered Switch Design with Wormhole Routing and Virtual Channels
ICCD '98 Proceedings of the International Conference on Computer Design
Hi-index | 0.00 |
Applications with widely shared data do not perform well on cc-NUMA multiprocessors due to the hot-spots they create in the system. In this paper we address this problem by enhancing the memory controller with a forwarding mechanism capable of hiding the read latency of widely shared data, while potentially decreasing the memory and network contention. Based on the influx of requests, the memory anticipates the next read references and forwards the data in advance to the processors. To identify the set of processors the data is to be forwarded to we use a heuristic based on the spatial locality of memory blocks. To increase the forwarding effectiveness and minimize the number of messages, we incorporate simple filters combined with a feedback mechanism. We also show that further improvements are possible using a combined software-prefetching/hardware-forwarding approach. Our experimental results obtained with a detailed execution driven simulator with ILP processors show significant improvements in execution time (up to 37%).