SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods and their implications for multiprocessors
Parallel hierarchical N-body methods and their implications for multiprocessors
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Global arrays: a nonuniform memory access programming model for high-performance computers
The Journal of Supercomputing
CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Benchmark Measurements of Current UPC Platforms
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The rise and fall of High Performance Fortran: an historical object lesson
Proceedings of the third ACM SIGPLAN conference on History of programming languages
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Hybrid parallel programming with MPI and unified parallel C
Proceedings of the 7th ACM international conference on Computing frontiers
Fast PGAS Implementation of Distributed Graph Algorithms
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Shared work list: hacking amorphous data parallelism in UPC
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
NUMA-aware shared-memory collective communication for MPI
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Enabling highly-scalable remote memory access programming with MPI-3 one sided
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
PGAS languages' support of a global name space facilitates the expression of parallel algorithms, since communication is implicit. This is especially convenient when writing irregular applications with data-dependent, dynamically changing communication patterns. However, programming in a shared memory style, with no explicit control of communication, may result in poor performance. The problem may be due to weaknesses of current implementations of PGAS languages or limitations inherent in these languages. To clarify which is the case, we discuss an implementation in UPC of the Barnes-Hut algorithm. A literal port of a good quality shared-memory implementation (merely replacing shared arrays with partitioned global arrays) achieves abysmal performance -- more than 1000 times worse than a message-passing implementation. We achieve in UPC a performance comparable to message-passing with a series of optimizations. Most of these optimizations could be performed with limited changes in the source code using an enhanced run-time and a few language extensions or pragmas. We discuss the implications to the programmer, the compiler and PGAS languages themselves.