Optimizing the Barnes-Hut algorithm in UPC

Authors:
Junchao Zhang;Babak Behzad;Marc Snir
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 14
Cited 3

SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Parallel hierarchical N-body methods

Parallel hierarchical N-body methods
Parallel hierarchical N-body methods and their implications for multiprocessors

Parallel hierarchical N-body methods and their implications for multiprocessors
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
A Comparative Characterization of Communication Patterns in Applications Using MPI and Shared Memory on an IBM SP2

CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Benchmark Measurements of Current UPC Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Hybrid parallel programming with MPI and unified parallel C

Proceedings of the 7th ACM international conference on Computing frontiers
Fast PGAS Implementation of Distributed Graph Algorithms

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Shared work list: hacking amorphous data parallelism in UPC

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Enabling highly-scalable remote memory access programming with MPI-3 one sided

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

PGAS languages' support of a global name space facilitates the expression of parallel algorithms, since communication is implicit. This is especially convenient when writing irregular applications with data-dependent, dynamically changing communication patterns. However, programming in a shared memory style, with no explicit control of communication, may result in poor performance. The problem may be due to weaknesses of current implementations of PGAS languages or limitations inherent in these languages. To clarify which is the case, we discuss an implementation in UPC of the Barnes-Hut algorithm. A literal port of a good quality shared-memory implementation (merely replacing shared arrays with partitioned global arrays) achieves abysmal performance -- more than 1000 times worse than a message-passing implementation. We achieve in UPC a performance comparable to message-passing with a series of optimizations. Most of these optimizations could be performed with limited changes in the source code using an enhanced run-time and a few language extensions or pragmas. We discuss the implications to the programmer, the compiler and PGAS languages themselves.