Memory coherence in shared virtual memory systems
PODC '86 Proceedings of the fifth annual ACM symposium on Principles of distributed computing
Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An integrated compile-time/run-time software distributed shared memory system
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
An Implementation of Interprocedural Bounded Regular Section Analysis
IEEE Transactions on Parallel and Distributed Systems
A Unified Formalization of Four Shared-Memory Models
IEEE Transactions on Parallel and Distributed Systems
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors
MASCOTS '94 Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems
Improving Release-Consistent Shared Virtual Memory using Automatic Update
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Compile-time Synchronization Optimizations for Software DSMs
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Application-Controlled Coherence Protocols for Scope Consistent Software DSMs
HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
OMPC++ - A Portable High-Performance Implementation of DSM using OpenC++ Reflection
Reflection '99 Proceedings of the Second International Conference on Meta-Level Architectures and Reflection
Low-Overhead, high-speed multi-core barrier synchronization
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
Software Distributed Shared Memory (SDSM)systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the net worktraffic to preserve the coherence, or page misses caused by false sharing, relaxed memory models are widely accepted for the SDSM systems. In the relaxed memory models, when a shared page is modified, in validation requests to other copies are deferred until a synchronization point and, in addition, the requests are transferred only to the processor acquiring the synchronization variable. On a barrier, however, the invalidation requests must be transferred to all the processors that participate in the barrier. As a result, it tends to induce heavy network traffic, and also may lead to useless page misses by false sharing.In this paper, we propose a method to alleviate the coherence overheads of barrier synchronization in shared-memory parallel programs. It performs static analysis to examine data dependency between processors across global barriers, and then inserts special primitives into the program in order to exploit the dependency information at run time. The static analysis finds out coderegions where a processor modifies data that will be used only by some of the other processors. At run time, the coherence messages for the data are transferred only to the processors with the help of the inserted primitives. In particular, if the modified data will not be used by any other processors, the primitives enforce that the coherence messages are delivered only to master processor when the parallel execution of the program is finished.We evaluated the performance of this method in a 16-node software DSM system supporting AURC protocol. Program-driven simulation was performed with five benchmark programs: Jacobi, Red-black SOR, Expl, LU, and Water-nsquared. For the applications, the experimental results show that our method can reduce the coherence messages by up to about 98%, and also can improve the execution time by up to about 26%.