Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
ICPP '02 Proceedings of the 2001 International Conference on Parallel Processing
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Shared memory programming for large scale machines
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
A PGAS-Based Algorithm for the Longest Common Subsequence Problem
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Toward an application support layer: numerical computation in unified parallel c
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Towards a complexity model for design and analysis of PGAS-based algorithms
HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Hi-index | 0.00 |
The Distributed Shared Memory (DSM) model is designed to leverage the ease ofprogramming of the shared memory paradigm, while enabling the highperformance by expressing locality as in the messagepassing model. Experience, however, has shown that DSM programming languages, such as UPC, may be unable to deliver the expected high level of performance. Initial investigations have shown that among the major reasons is the overhead of translating from the UPC memory model to the target architecture virtualaddresses space, which can be very costly. Experimental measurements have shown this overhead increasing execution time by up to three orders of magnitude. Previous work has also shown that some of this overhead can be avoided by hand-tuning, which on the other hand can significantly decrease the UPC ease of use. In addition, such tuning can only improve the performance of local shared accesses but not remote shared accesses. Therefore, a new technique that resembles the Translation Look Aside Buffers (TLBs) is proposed here. This technique, which is called the Memory Model Translation Buffer (MMTB) has been implemented in the GCC-UPC compiler using two alternative strategies, full-table (FT) and reduced-table (RT). It will be shown that the MMTB strategies can lead to a performance boost of up to 700%, enabling ease-of-programming while performing at a similar performance to hand-tuned UPC and MPI codes.