Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

  • Authors:
  • Affiliations:
  • Venue:
  • IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

Irregular, particle-based applications that use trees, for example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programming advantages to application users it is important that the shared address space model not only perform and scale well in the tightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: While the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular on platforms with less aggressive, commodity-oriented communication architectures (even though it takes less than 3 percent of the time in most sequential executions). We therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this pmgramming model: an SGI Origin2000 (an aggressive hardware cache-coherent machine with physically distributed memory), an SGI Challenge bus-based shared memory multiprocessor, an Intel Paragon running a shared virtual memory protocol in software at page granularity, and a Wisconsin Typhoon-zero in which the granularity of coherence can be varied using hardware support but the protocol runs in software (in the last case using both a page-based and a fine-grained protocol). We find that the algorithms used successfully and widely distributed so far for the first two platforms cause overall application performance to be very poor on the latter two commodity-oriented platforms. An alternative algorithm briefy considered earlier for hardware coherent systems but then ignored in that context helps to some extent but not enough. Nor does an algorithm that incrementally updates the tree every time-step rather than rebuilding it. The best algorithm by far is a new one we propose that uses a separate spatial partitioning of the domain for the tree building phase--which is different than the partitioning used in the major force calculation and other phases--and eliminates locking at a significant cost in locality and load balance. By changing the tree building algorithm, we achieve improvements in overall application performance of more than factors of 4-40 on commodity-based systems, even on only 16 processors. This allows commodity shared memory platforms to perform well for hierarchical N-body applications for the first time, and more importantly achieves performance portability since it also performs very well on hardware-coherent systems.