Munin: distributed shared memory based on type-specific memory coherence
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Alternative implementations of two-level adaptive branch prediction
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols
Proceedings of the 25th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
The Augmint multiprocessor simulation toolkit for Intel x86 architectures
ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
PRISM: An Integrated Architecture for Scalable Shared Memory
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Identification and optimization of sharing patterns for scalable shared-memory multiprocessors
Identification and optimization of sharing patterns for scalable shared-memory multiprocessors
Hi-index | 0.00 |
This paper proposes a hardware mechanism for reducing coherency overhead occurring in scientific computations within DSM systems. A first phase aims at detecting, in the address space regular patterns (called streams) of coherency events (such as requests for exclusive, shared or invalidation).Once a stream is detected at a loop level, regularity of data access can be exploited at the loop level (spatial locality) but also between loops (temporal locality). We present a hardware mechanism capable of detecting and exploiting efficiently these regular patterns. Expectable benefits as well as hardware complexity are discussed and the limited drawbacks and potential overheads are exposed. For a benchmarks suite of typical scientific applications results are very promising, both in terms of coherency streams and the effectiveness of our optimizations.