Hybrid PGAS runtime support for multicore nodes

Authors:
Filip Blagojević;Paul Hargrove;Costin Iancu;Katherine Yelick
Affiliations:
Lawrence Berkeley National Laboratory;Lawrence Berkeley National Laboratory;Lawrence Berkeley National Laboratory;Lawrence Berkeley National Laboratory
Venue:
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Year:
2010

Citing 16
Cited 5

Introduction to algorithms

Introduction to algorithms
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
GASNet Specification, v1.1

GASNet Specification, v1.1
Multigrain parallel Delaunay Mesh generation: challenges and opportunities for multithreaded architectures

Proceedings of the 19th annual international conference on Supercomputing
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Productivity and performance using partitioned global address space languages

Proceedings of the 2007 international workshop on Parallel symbolic computation
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance portable optimizations for loops containing communication operations

Proceedings of the 22nd annual international conference on Supercomputing
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A practical study of UPC using the NAS Parallel Benchmarks

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models

Optimizing UPC programs for multi-core systems

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Congestion avoidance on manycore high performance computing systems

Proceedings of the 26th ACM international conference on Supercomputing
Composable, non-blocking collective operations on power7 IH

Proceedings of the 26th ACM international conference on Supercomputing
Towards efficient GPU sharing on multicore processors

ACM SIGMETRICS Performance Evaluation Review
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.