An evaluation of global address space languages: co-array fortran and unified parallel C
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance evaluation of supercomputers using HPCC and IMB Benchmarks
Journal of Computer and System Sciences
Toward enhancing OpenMP's work-sharing directives
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.00 |
In the future, most systems in high-performance computing (HPC) will have a hierarchical hardware design, e.g., a cluster of ccNUMA or shared memory nodes with each node having several multi-core CPUs. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside each node. There are many mismatch problems between hybrid hardware topology and the hybrid or homogeneous parallel programming models on such hardware. Hybrid programming with a combination of MPI and OpenMP is often slower than pure MPI programming. Major chances arise from the load balancing features of OpenMP and from a smaller memory footprint if the application duplicates some data on all MPI processes [1,2,3].