Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

Authors:
Jarek Nieplocha;Bruce Palmer;Vinod Tipparaju;Manojkumar Krishnan;Harold Trease;Edoardo Aprà/
Affiliations:
COMPUTATIONAL SCIENCES AND MATHEMATICS DEPARTMENT, PACIFIC NORTHWEST NATIONAL LABORATORY, RICHLAND, WA 99352/;-;-;-;COMPUTATIONAL SCIENCES AND MATHEMATICS DEPARTMENT, PACIFIC NORTHWEST NATIONAL LABORATORY, RICHLAND, WA 99352;WILLIAM R. WILEY ENVIRONMENTAL MOLECULAR SCIENCES LABORATORY, PACIFIC NORTHWEST NATIONAL LABORATORY, RICHLAND, WA 99352
Venue:
International Journal of High Performance Computing Applications
Year:
2006

Citing 45
Cited 34

Allocating Independent Subtasks on Parallel Processors

IEEE Transactions on Software Engineering
Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Correct memory operation of cache-based multiprocessors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scan primitives for vector computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Data-parallel programming on MIMD computers

Data-parallel programming on MIMD computers
Static analysis to reduce synchronization costs in data-parallel programs

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Reducing synchronization overhead in parallel simulation

PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Optimizing collective I/O performance on parallel computers: a multisystem study

ICS '97 Proceedings of the 11th international conference on Supercomputing
Shared Memory Programming in Metacomputing Environments: The Global Array Approach

The Journal of Supercomputing - Special issue: high performance distributed computing
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
A programmer's guide to ZPL

A programmer's guide to ZPL
Computational chemistry on Fujitsu vector-parallel processors: hardware and programming environment

Parallel Computing - computational chemistry
Performance of hybrid message-passing and shared-memory parallelism for discrete element modeling

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A comparison of three programming models for adaptive applications on the Origin2000

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
An out-of-core implementation of the COLUMBUS massively-parallel multireference configuration interaction program

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Parallel Computing in Computational Chemistry

Parallel Computing in Computational Chemistry
Compile Time Barrier Synchronization Minimization

IEEE Transactions on Parallel and Distributed Systems
Terascale spectral element dynamical core for atmospheric general circulation models

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Global arrays: a portable "shared-memory" programming model for distributed memory computers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
One-Sided Communication on Clusters with Myrinet

Cluster Computing
Fast, Adaptively Refined Computational Elements in 3D

ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Towards OpenMP Execution on Software Distributed Shared Memory Systems

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Overture: An Object-Oriented Framework for Solving Partial Differential Equations

ISCOPE '97 Proceedings of the Scientific Computing in Object-Oriented Parallel Environments
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Disk Resident Arrays: An Array-Oriented I/O Library for Out-Of-Core Computations

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Dynamically Controlling False Sharing in Distributed Shared Memory

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Shared Memory NUMA Programming on I-WAY

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Toward a Common Component Architecture for High-Performance Scientific Computing

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Gigapixel-Size Real-Time Interactive Image Processing with Parallel Computers

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Optimizing Synchronization Operations for Remote Memory Communication Systems

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Performance and Experience with LAPI -- A New High-Performance Communication Library for the IBM RS/6000 SP

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Generalized portable shmem library for high performance computing

Generalized portable shmem library for high performance computing
Optimizing Parallel Multiplication Operation for Rectangular and Transposed Matrices

ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Exploiting processor groups to extend scalability of the GA shared memory programming model

Proceedings of the 2nd conference on Computing frontiers
Multilevel Parallelism in Computational Chemistry using Common Component Architecture and Global Arrays

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Using the GA and TAO toolkits for solving large-scale optimization problems on parallel computers

ACM Transactions on Mathematical Software (TOMS)

Parallelization of the NAS Conjugate Gradient Benchmark Using the Global Arrays Shared Memory Programming Model

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 4 - Volume 05
Exploiting processor groups to extend scalability of the GA shared memory programming model

Proceedings of the 2nd conference on Computing frontiers
Multilevel Parallelism in Computational Chemistry using Common Component Architecture and Global Arrays

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Topology-aware tile mapping for clusters of SMPs

Proceedings of the 3rd conference on Computing frontiers
ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

IEEE Transactions on Parallel and Distributed Systems
Enabling rapid development of parallel tree search applications

Proceedings of the 5th IEEE workshop on Challenges of large applications in distributed environments
Accelerating tropical cyclone analysis using LambdaRAM, a distributed data cache over wide-area ultra-fast networks

Future Generation Computer Systems
Latency-Optimized Parallelization of the FMM Near-Field Computations

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Integrated Data and Task Management for Scientific Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Development of high performance scientific components for interoperability of computing packages

SpringSim '09 Proceedings of the 2009 Spring Simulation Multiconference
A Component-Based Framework for Smoothed Particle Hydrodynamics Simulations of Reactive Fluid Flow in Porous Media

International Journal of High Performance Computing Applications
Hybrid parallel programming with MPI and unified parallel C

Proceedings of the 7th ACM international conference on Computing frontiers
Enabling a highly-scalable global address space model for petascale computing

Proceedings of the 7th ACM international conference on Computing frontiers
A global address space framework for irregular applications

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
An extensible global address space framework with decoupled task and data abstractions

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An approach to locality-conscious load balancing and transparent memory hierarchy management with a global- address-space parallel programming model

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
Application-specific fault tolerance via data access characterization

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Noncollective communicator creation in MPI

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Leveraging C++ meta-programming capabilities to simplify the message passing programming model

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
The Combinatorial BLAS: design, implementation, and applications

International Journal of High Performance Computing Applications
Data and computation abstractions for dynamic and irregular computations

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
Poster: High-level, one-sided programming models on MPI: a case study with global arrays and NWChem

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Poster: automatic parallelization of numerical python applications using the global arrays toolkit

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
GA-GPU: extending a library-based global address spaceprogramming model for scalable heterogeneouscomputing systems

Proceedings of the 9th conference on Computing Frontiers
Enhancing the performance of assisted execution runtime systems through hardware/software techniques

Proceedings of the 26th ACM international conference on Supercomputing
Performance characterization of global address space applications: a case study with NWChem

Concurrency and Computation: Practice & Experience
Global Futures: A Multithreaded Execution Model for Global Arrays-based Applications

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
The Red Storm Architecture and Early Experiences with Multi-Core Processors

International Journal of Distributed Systems and Technologies
A framework for load balancing of tensor contraction expressions via dynamic task partitioning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A system framework and API for run-time adaptable parallel software

Proceedings of the 2013 Research in Adaptive and Convergent Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an inteface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the attractiveness of using higher level abstractions to write parallel code.