An HPF compiler for the IBM SP2
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers
ICS '95 Proceedings of the 9th international conference on Supercomputing
Global communication analysis and optimization
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Global arrays: a nonuniform memory access programming model for high-performance computers
The Journal of Supercomputing
Programming with POSIX threads
Programming with POSIX threads
Communication optimizations for parallel C programs
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
GASNet Specification, v1.1
Evaluating support for global address space languages on the Cray X1
Proceedings of the 18th annual international conference on Supercomputing
Fast Address Translation Techniques for Distributed Shared Memory Compilers
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An evaluation of global address space languages: co-array fortran and unified parallel C
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
Design and implementation of message-passing services for the Blue Gene/L supercomputer
IBM Journal of Research and Development
Type inference for locality analysis of distributed data structures
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 22nd annual international conference on Supercomputing
Static Detection of Place Locality and Elimination of Runtime Checks
APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
Efficient, portable implementation of asynchronous multi-place programs
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A Parallel Numerical Library for UPC
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A characterization of shared data access patterns in UPC programs
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Cloud-TM: harnessing the cloud with distributed transactional memories
ACM SIGOPS Operating Systems Review
ScaleUPC: a UPC compiler for multi-core systems
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Optimizing UPC programs for multi-core systems
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Unified parallel C for GPU clusters: language extensions and compiler implementation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unifying UPC and MPI runtimes: experience with MVAPICH
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Reflex: using low-power processors in smartphones without knowing them
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
UPCBLAS: a library for parallel matrix computations in Unified Parallel C
Concurrency and Computation: Practice & Experience
Automatic communication coalescing for irregular computations in UPC language
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Performance evaluation of sparse matrix products in UPC
The Journal of Supercomputing
Improving communication in PGAS environments: static and dynamic coalescing in UPC
Proceedings of the 27th international ACM conference on International conference on supercomputing
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.