Munin: distributed shared memory based on type-specific memory coherence
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
LimitLESS directories: A scalable cache coherence scheme
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Automatic software cache coherence through vectorization
ICS '92 Proceedings of the 6th international conference on Supercomputing
Life span strategy—a compiler-based approach to cache coherence
ICS '92 Proceedings of the 6th international conference on Supercomputing
Cooperative shared memory: software and hardware for scalable multiprocessor
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Cache coherence using local knowledge
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Simple compiler algorithms to reduce ownership overhead in cache coherence protocols
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The Omega Library interface guide
The Omega Library interface guide
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A graph based approach to barrier synchronisation minimisation
ICS '97 Proceedings of the 11th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Array SSA form and its use in parallelization
POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Using prediction to accelerate coherence protocols
Proceedings of the 25th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Journal of Parallel and Distributed Computing
Selective, accurate, and timely self-invalidation using last-touch prediction
Proceedings of the 27th annual international symposium on Computer architecture
A compiler-directed cache coherence scheme with improved intertask locality
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
EDS: A Parallel Computer System for Advanced Information Processing
PARLE '92 Proceedings of the 4th International PARLE Conference on Parallel Architectures and Languages Europe
Compiler Reduction of Invalidation Traffic in Virtual Shared Memory Systems
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing - Volume I
An evaluation of DELTA, a decoupled pre-fetching virtual shared memory system
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Compiler analysis for cache coherence: Interprocedural array data-flow analysis and its impacts on cache performance
A Compiler Algorithm to Reduce Invalidation Latency in Virtual Shared Memory Systems
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Identification and optimization of sharing patterns for scalable shared-memory multiprocessors
Identification and optimization of sharing patterns for scalable shared-memory multiprocessors
Hi-index | 0.00 |
This paper develops and proves an exact distributed invalidation algorithm for programs with general array accesses, arbitrary parallelisation and migratory writes. We present an efficient constructive algorithm that globally combines locally gathered information to insert coherence calls in such a manner to eliminate invalidation traffic without loss of locality and places the minimal number of coherence calls. Experimental results across a range of benchmarks show that it outperforms hardware based sequential and release consistency approaches and decreases application execution time by up to 12%. This is due to eliminating over 99% of the invalidation traffic in all benchmarks. This dramatic reduction in invalidation traffic reduces the total amount of network traffic by up to 28% and the number of network words transmitted by up to 19%.