Shared memory programming for large scale machines

Authors:
Christopher Barton;CĆlin Casçaval;George Almási;Yili Zheng;Montse Farreras;Siddhartha Chatterje;José Nelson Amaral
Affiliations:
University of Alberta, Edmonton, Canada;IBM T.J.Watson Research Center, Yorktown Heights, NY;IBM T.J.Watson Research Center, Yorktown Heights, NY;Purdue University, West Lafayette IN;Universitat Politecnica de Catalunya, Barcelona Spain;IBM T.J.Watson Research Center, Yorktown Heights, NY;University of Alberta, Edmonton, Canada
Venue:
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Year:
2006

Citing 18
Cited 19

An HPF compiler for the IBM SP2

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers

ICS '95 Proceedings of the 9th international conference on Supercomputing
Global communication analysis and optimization

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Programming with POSIX threads

Programming with POSIX threads
Communication optimizations for parallel C programs

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Performance and Experience with LAPI -- A New High-Performance Communication Library for the IBM RS/6000 SP

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
GASNet Specification, v1.1

GASNet Specification, v1.1
Evaluating support for global address space languages on the Cray X1

Proceedings of the 18th annual international conference on Supercomputing
Fast Address Translation Techniques for Distributed Shared Memory Compilers

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An evaluation of global address space languages: co-array fortran and unified parallel C

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Design and implementation of message-passing services for the Blue Gene/L supercomputer

IBM Journal of Research and Development

Type inference for locality analysis of distributed data structures

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
Static Detection of Place Locality and Elimination of Runtime Checks

APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
Efficient, portable implementation of asynchronous multi-place programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A Parallel Numerical Library for UPC

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A characterization of shared data access patterns in UPC programs

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Cloud-TM: harnessing the cloud with distributed transactional memories

ACM SIGOPS Operating Systems Review
ScaleUPC: a UPC compiler for multi-core systems

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Optimizing UPC programs for multi-core systems

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Unified parallel C for GPU clusters: language extensions and compiler implementation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unifying UPC and MPI runtimes: experience with MVAPICH

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Reflex: using low-power processors in smartphones without knowing them

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Concurrency and Computation: Practice & Experience
Automatic communication coalescing for irregular computations in UPC language

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Performance evaluation of sparse matrix products in UPC

The Journal of Supercomputing
Improving communication in PGAS environments: static and dynamic coalescing in UPC

Proceedings of the 27th international ACM conference on International conference on supercomputing
X10 and APGAS at Petascale

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.