Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The Stanford Dash Multiprocessor
Computer
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The KSR1: experimentation and modeling of poststore
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Execution-driven tools for parallel simulation of parallel architectures and applications
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors
Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors
Integrating fine-grained message passing in cache coherent shared memory multiprocessors
Journal of Parallel and Distributed Computing
Weak ordering—a new definition
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
Producer-Oriented versus Consumer-Oriented Prefetching: a Comparison and Analysis of Parallel Application Programs
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs
ICS '98 Proceedings of the 12th international conference on Supercomputing
Guest Editors' Introduction-Cache Memory and Related Problems: Enhancing and Exploiting the Locality
IEEE Transactions on Computers - Special issue on cache memory and related problems
Hardware spatial forwarding for widely shared data
Proceedings of the 14th international conference on Supercomputing
IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
A Parallel System Architecture Based on Dynamically Configurable Shared Memory Clusters
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Task Scheduling for Dynamically Configurable Multiple SMP Clusters Based on Extended DSC Approach
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Cache Injection: A Novel Technique for Tolerating Memory Latency in Bus-Based SMPs
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
CQoS: a framework for enabling QoS in shared caches of CMP platforms
Proceedings of the 18th annual international conference on Supercomputing
Accelerating data movement on future chip multi-processors
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Dynamic SMP clusters with communication on the fly
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Dynamic SMP clusters in soc technology – towards massively parallel fine grain numerics
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Parallel matrix multiplication based on dynamic SMP clusters in SoC technology
ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches.This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups.