Efficient and correct execution of parallel programs that share memory
ACM Transactions on Programming Languages and Systems (TOPLAS)
What are race conditions?: Some issues and formalizations
ACM Letters on Programming Languages and Systems (LOPLAS)
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An optimizing Fortran D compiler for MIMD distributed-memory machines
An optimizing Fortran D compiler for MIMD distributed-memory machines
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An HPF compiler for the IBM SP2
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers
ICS '95 Proceedings of the 9th international conference on Supercomputing
Global communication analysis and optimization
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
A Unified Framework for Optimizing Communication in Data-Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Analyses and optimizations for shared address space programs
Journal of Parallel and Distributed Computing - Special issue on compilation techniques for distributed memory systems
A new algorithm for partial redundancy elimination based on SSA form
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Communication optimizations for parallel C programs
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Advanced compiler design and implementation
Advanced compiler design and implementation
Basic compiler algorithms for parallel programs
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
A global communication optimization technique based on data-flow analysis and linear algebra
ACM Transactions on Programming Languages and Systems (TOPLAS)
Message Passing Vs. Shared Address Space on a Clusters of SMPs
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs
LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Effective Representation of Aliases and Indirect Memory Operations in SSA Form
CC '96 Proceedings of the 6th International Conference on Compiler Construction
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Effective communication coalescing for data-parallel applications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs
IEEE Transactions on Computers
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Shared memory programming for large scale machines
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Software behavior oriented parallelization
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Automatic nonblocking communication for partitioned global address space programs
Proceedings of the 21st annual international conference on Supercomputing
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Orchestrating data transfer for the cell/B.E. processor
Proceedings of the 22nd annual international conference on Supercomputing
Implementation of parallel programs interpreter in the development environment ParJava
Programming and Computing Software
DBDB: optimizing DMATransfer for the cell be architecture
Proceedings of the 23rd international conference on Supercomputing
MPI-aware compiler optimizations for improving communication-computation overlap
Proceedings of the 23rd international conference on Supercomputing
A new ultra-low latency message transfer mechanism
CSN '07 Proceedings of the Sixth IASTED International Conference on Communication Systems and Networks
UTS: an unbalanced tree search benchmark
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Runtime address space computation for SDSM systems
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
ScaleUPC: a UPC compiler for multi-core systems
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A performance model for fine-grain accesses in UPC
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hybrid PGAS runtime support for multicore nodes
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Optimizing the Barnes-Hut algorithm in UPC
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Journal of Parallel and Distributed Computing
Automatic communication coalescing for irregular computations in UPC language
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Design of an application-dependent static-based shared memory network
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Improving communication in PGAS environments: static and dynamic coalescing in UPC
Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA
Proceedings of the Conference on Design, Automation and Test in Europe
Experiences Developing the OpenUH Compiler and Runtime Infrastructure
International Journal of Parallel Programming
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors
Journal of Signal Processing Systems
Hi-index | 0.00 |
Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applications use bulk memory copies rather than individual reads and writes to the shared space, but finer grained sharing can be useful for scenarios such as dynamic load balancing, event signaling, and distributed hash tables. In this paper we present three optimization techniques for global address space programs with fine-grained communication: redundancy elimination, use of split-phase communication, and communication coalescing. Parallel UPC programs are analyzed using static single assignment form and a data flow graph, which are extended to handle the various shared and private pointer types that are available in UPC. The optimizations also take advantage of UPCýs relaxed memory consistency model, which reduces the need for cross thread analysis. We demonstrate the effectiveness of the analysis and optimizations using several benchmarks, which were chosen to reflect the kinds of fine grained, communication-intensive phases that exist in some larger applications. The optimizations show speedups of up to 70% on three parallel systems, which represent three different types of cluster network technologies.