Introduction to algorithms
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
GASNet Specification, v1.1
Proceedings of the 19th annual international conference on Supercomputing
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance portable optimizations for loops containing communication operations
Proceedings of the 22nd annual international conference on Supercomputing
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A practical study of UPC using the NAS Parallel Benchmarks
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Optimizing UPC programs for multi-core systems
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Congestion avoidance on manycore high performance computing systems
Proceedings of the 26th ACM international conference on Supercomputing
Composable, non-blocking collective operations on power7 IH
Proceedings of the 26th ACM international conference on Supercomputing
Towards efficient GPU sharing on multicore processors
ACM SIGMETRICS Performance Evaluation Review
NUMA-aware shared-memory collective communication for MPI
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.