Exploiting processor groups to extend scalability of the GA shared memory programming model

Authors:
Jarek Nieplocha;Manoj Krishnan;Bruce Palmer;Vinod Tipparaju;Yeliang Zhang
Affiliations:
Pacific Northwest National Laboratory;Pacific Northwest National Laboratory;Pacific Northwest National Laboratory;Pacific Northwest National Laboratory;University of Arizona
Venue:
Proceedings of the 2nd conference on Computing frontiers
Year:
2005

Citing 16
Cited 3

Advanced programming in the UNIX environment

Advanced programming in the UNIX environment
Performance of the NAS parallel benchmarks on PVM-based networks

Journal of Parallel and Distributed Computing
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
A new model for integrated nested task and data parallel programming

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Approaches for Integrating Task and Data Parallelism

IEEE Concurrency
Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Library support for hierarchical multi-processor tasks

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance comparison of MPI and three openMP programming styles on shared memory multiprocessors

Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
Dynamically Controlling False Sharing in Distributed Shared Memory

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Multilevel Parallelization Models: Application to VIV

DOD_UGC '03 Proceedings of the 2003 DoD User Group Conference
Parallel, multigrain iterative solvers for hiding network latencies on MPPs and networks of clusters

Parallel Computing - Parallel matrix algorithms and applications (PMAA '02)
Processor-Group Aware Runtime Support for Shared- and Global-Address Space Models

ICPPW '04 Proceedings of the 2004 International Conference on Parallel Processing Workshops
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications

Multilevel Parallelism in Computational Chemistry using Common Component Architecture and Global Arrays

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploiting processor groups is becoming increasingly important for programming next-generation high-end systems composed of tens or hundreds of thousands of processors. This paper discusses the requirements, functionality and development of multilevel-parallelism based on processor groups in the context of the Global Array (GA) shared memory programming model. The main effort involves management of shared data, rather than interprocessor communication. Experimental results for the NAS NPB Conjugate Gradient benchmark and a molecular dynamics (MD) application are presented for a Linux cluster with Myrinet and illustrate the value of the proposed approach for improving scalability. While the original GA version of the CG benchmark lagged MPI, the processor-group version outperforms MPI in all cases, except for a few points on the smallest problem size. Similarly, processor groups were very effective in improving scalability of a Molecular Dynamics application