ScaLAPACK user's guide
Global arrays: a portable "shared-memory" programming model for distributed memory computers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Dynamic Load Balancing of MPI+OpenMP Applications
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Parallel Programming and Parallel Abstractions in Fortress
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Development of mixed mode MPI / OpenMP applications
Scientific Programming
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Investigating High Performance RMA Interfaces for the MPI-3 Standard
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Efficient data race detection for distributed memory parallel programs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing the Barnes-Hut algorithm in UPC
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring cross-layer power management for PGAS applications on the SCC platform
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Towards autotuning by alternating communication methods
ACM SIGMETRICS Performance Evaluation Review
Enabling MPI interoperability through flexible communication endpoints
Proceedings of the 20th European MPI Users' Group Meeting
A classification of scientific visualization algorithms for massive threading
UltraVis '13 Proceedings of the 8th International Workshop on Ultrascale Visualization
Portable, MPI-interoperable coarray fortran
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.