Effective cache prefetching on bus-based multiprocessors
ACM Transactions on Computer Systems (TOCS)
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Architectural mechanisms for explicit communication in shared memory multiprocessors
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A compiler algorithm that reduces read latency in ownership-based cache coherence protocols
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Data Forwarding in Scalable Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Two techniques for improving performance on bus-based multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
(R) The Impact of Speeding up Critical Sections with Data Prefetching and Forwarding
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
A Parallel System Architecture Based on Dynamically Configurable Shared Memory Clusters
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Task Scheduling for Dynamically Configurable Multiple SMP Clusters Based on Extended DSC Approach
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Dynamic SMP clusters with communication on the fly
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Multi-CMP system with data communication on the fly
The Journal of Supercomputing
Dynamic SMP clusters in soc technology – towards massively parallel fine grain numerics
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Scheduling architecture---supported regions in parallel programs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Data transfers on the fly for hierarchical systems of chip multi-processors
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Scheduling parallel programs based on architecture: supported regions
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Parallel matrix multiplication based on dynamic SMP clusters in SoC technology
ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Hi-index | 0.00 |
Cache misses and bus traffic are key obstacles to achieving high performance of bus-based shared memory multiprocessors using invalidation-based snooping caches. To overcome these problems, software-controlled techniques for tolerating memory latency can be used, such as cache prefetching and data forwarding. However, some previous studies have shown that cache prefetching is not so effective in bus-based shared memory multiprocessors, while data forwarding is not easy to implement in this environment. In this paper, we propose a novel technique called cache injection, which combines consumer and producer initiated approaches, as well as the broadcasting nature of bus. Performance evaluation based on program-driven simulation and a set of eight parallel benchmark programs shows that cache injection is highly effective in reducing coherence misses and bus traffic.