Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

Authors:
Affiliations:
Venue:
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Year:
1998

Citing 11
Cited 7

Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Parallel hierarchical N-body methods

Parallel hierarchical N-body methods
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity

Journal of Parallel and Distributed Computing
Understanding application performance on shared virtual memory systems

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture

Efficient Parallel Algorithms and Software for Compressed Octrees with Applications to Hierarchical Methods

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Message Passing Vs. Shared Address Space on a Clusters of SMPs

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Message passing and shared address space parallelism on an SMP cluster

Parallel Computing
Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods

Parallel Computing
Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods

Parallel Computing
Parallel TBox Classification in Description Logics --First Experimental Results

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Optimizing the Barnes-Hut algorithm in UPC

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Irregular, particle-based applications that use trees, for example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programming advantages to application users it is important that the shared address space model not only perform and scale well in the tightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: While the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular on platforms with less aggressive, commodity-oriented communication architectures (even though it takes less than 3 percent of the time in most sequential executions). We therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this pmgramming model: an SGI Origin2000 (an aggressive hardware cache-coherent machine with physically distributed memory), an SGI Challenge bus-based shared memory multiprocessor, an Intel Paragon running a shared virtual memory protocol in software at page granularity, and a Wisconsin Typhoon-zero in which the granularity of coherence can be varied using hardware support but the protocol runs in software (in the last case using both a page-based and a fine-grained protocol). We find that the algorithms used successfully and widely distributed so far for the first two platforms cause overall application performance to be very poor on the latter two commodity-oriented platforms. An alternative algorithm briefy considered earlier for hardware coherent systems but then ignored in that context helps to some extent but not enough. Nor does an algorithm that incrementally updates the tree every time-step rather than rebuilding it. The best algorithm by far is a new one we propose that uses a separate spatial partitioning of the domain for the tree building phase--which is different than the partitioning used in the major force calculation and other phases--and eliminates locking at a significant cost in locality and load balance. By changing the tree building algorithm, we achieve improvements in overall application performance of more than factors of 4-40 on commodity-based systems, even on only 16 processors. This allows commodity shared memory platforms to perform well for hierarchical N-body applications for the first time, and more importantly achieves performance portability since it also performs very well on hardware-coherent systems.