Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
The priority-based coloring approach to register allocation
ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance evaluation of memory consistency models for shared-memory multiprocessors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
The Stanford Dash Multiprocessor
Computer
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessor
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Adaptive cache coherency for detecting migratory shared data
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Portable Programs for Parallel Processors
Portable Programs for Parallel Processors
A New Solution to Coherence Problems in Multicache Systems
IEEE Transactions on Computers
A comprehensive bibliography of distributed shared memory
ACM SIGOPS Operating Systems Review
Lazy release consistency for hardware-coherent multiprocessors
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A compiler algorithm that reduces read latency in ownership-based cache coherence protocols
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Using dataflow analysis techniques to reduce ownership overhead in cache coherence protocols
ACM Transactions on Programming Languages and Systems (TOPLAS)
Using prediction to accelerate coherence protocols
Proceedings of the 25th annual international symposium on Computer architecture
IEEE Transactions on Parallel and Distributed Systems
Exact Distributed Invalidation
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Bus-based COMA-reducing traffic in shared-bus multiprocessors
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Towards general and exact distributed invalidation
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in shared-memory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads are marked and constitute a hint to the cache to obtain an exclusive copy of the block. We consider three algorithms where the first one focuses on load-store sequences within each basic block of code and the other two analyse the existence of load-store sequences across basic blocks at the intra-procedural level. Since the dataflow analysis we adopt is a trivial variation of live-variable analysis, the algorithms are easily incorporated into a compiler.Through detailed simulations of a cache-coherent NUMA architecture using five scientific parallel benchmark programs, we find that the algorithms are capable of removing over 95% of the separate ownership acquisitions. Moreover, we also find that even the simplest algorithm is comparable in efficiency with previously proposed hardware-based adaptive cache coherence protocols to attack the same problem.