Bridging parallel and reconfigurable computing with multilevel PGAS and SHMEM+
Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
Optimizing UPC programs for multi-core systems
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
SHMEM+: A multilevel-PGAS programming model for reconfigurable supercomputing
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Scaling scientific applications on clusters of hybrid multicore/GPU nodes
Proceedings of the 8th ACM International Conference on Computing Frontiers
Unifying UPC and MPI runtimes: experience with MVAPICH
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Optimizing the Barnes-Hut algorithm in UPC
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ACM SIGMETRICS Performance Evaluation Review
Compass: a scalable simulator for an architecture for cognitive computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Communication avoiding and overlapping for numerical linear algebra
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Performance evaluation of sparse matrix products in UPC
The Journal of Supercomputing
Enabling highly-scalable remote memory access programming with MPI-3 one sided
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM BlueGene/P, an architecture that combines low-power, quad-core processors with extreme scalability. We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case study of the communication-limited benchmark, NAS FT. We scale the benchmark up to 16,384 cores of the BlueGene/P and demonstrate that UPC consistently outperforms MPI by as much as 66% for some processor configurations and an average of 32%. In addition, the results demonstrate the scalability of the PGAS model and the Berkeley implementation of UPC, the viability of using it on machines with multicore nodes, and the effectiveness of the BG/P communication layer for supporting one-sided communication and PGAS languages.