Implications of hierarchical N-body methods for multiprocessor architectures

Authors:
Jaswinder Pal Singh;John L. Hennessy;Anoop Gupta
Affiliations:
Stanford Univ., Stanford, CA;Princeton Univ., Princeton, NJ;Stanford Univ., Stanford, CA
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1995

Citing 20
Cited 25

A fast algorithm for particle simulations

Journal of Computational Physics
Analysis of cache invalidation patterns in multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A rapid hierarchical radiosity algorithm

Proceedings of the 18th annual conference on Computer graphics and interactive techniques
Finding and exploiting parallelism in an ocean simulation program: experience, results, and implications

Journal of Parallel and Distributed Computing
Parallel hierarchical N-body methods

Parallel hierarchical N-body methods
Parallel hierarchical N-body methods and their implications for multiprocessors

Parallel hierarchical N-body methods and their implications for multiprocessors
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity

Journal of Parallel and Distributed Computing
Hierarchical algorithms and architectures for parallel scientific computing

ICS '90 Proceedings of the 4th international conference on Supercomputing
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The working set model for program behavior

Communications of the ACM
Scaling Parallel Programs for Multiprocessors: Methodology and Examples

Computer
Tango introduction and tutorial

Tango introduction and tutorial
SPLASH: Stanford parallel applications for shared-memory*

SPLASH: Stanford parallel applications for shared-memory*
The rapid evaluation of potential fields in particle systems

The rapid evaluation of potential fields in particle systems

A parallel adaptive fast multipole method

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Message passing versus distributed shared memory on networks of workstations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Performance evaluation of message-driven parallel VLSI CAD applications on general purpose multiprocessors

ICS '97 Proceedings of the 11th international conference on Supercomputing
Performance implications of communication mechanisms in all-software global address space systems

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Predicting the performance of distributed virtual shared-memory applications

IBM Systems Journal
Scaling application performance on a cache-coherent multiprocessor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications

IEEE Transactions on Parallel and Distributed Systems
A comparison of MPI, SHMEM and cache-coherent shared address space programming models on the SGI Origin2000

ICS '99 Proceedings of the 13th international conference on Supercomputing
Overlapping multi-processing and graphics hardware acceleration: performance evaluation

PVGS '99 Proceedings of the 1999 IEEE symposium on Parallel visualization and graphics
Experiences with Parallel N-Body Simulation

IEEE Transactions on Parallel and Distributed Systems
Accelerating shared virtual memory via general-purpose network interface support

ACM Transactions on Computer Systems (TOCS)
A comparison of three programming models for adaptive applications on the Origin2000

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A comparison of three programming models for adaptive applications on the origin2000

Journal of Parallel and Distributed Computing
A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors

International Journal of Parallel Programming
Improving Application Performance on the HP/Convex Exemplar

Computer
Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Journal of Parallel and Distributed Computing
Solving irregularly structured problems based on distributed object model

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Massively parallel implementation of a fast multipole method for distributed memory machines

Journal of Parallel and Distributed Computing
Irregular computations in Fortran - expression and implementation strategies

Scientific Programming
Data exploration of turbulence simulations using a database cluster

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Fast multipole methods on graphics processors

Journal of Computational Physics
Study of hierarchical n-body methods for network-on-chip architectures

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Memory-access optimization of parallel molecular dynamics simulation via dynamic data reordering

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

To design effective large-scale multiprocessors, designers need to understand the characteristics of the applications that will use the machines. Application characteristics of particular interest include the amount of communication relative to computation, the structure of the communication, and the local cache and memory requirements, as well as how these characteristics scale with larger problems and machines. One important class of applications is based on hierarchical N-body methods, which are used to solve a wide range of scientific and engineering problems efficiently. Important characteristics of these methods include the nonuniform and dynamically changing nature of the domains to which they are applied, and their use of long-range, irregular communication. This article examines the key architectural implications of representative applications that use the two dominant hierarchical N-body methods: the Barnes-Hut Method and the Fast Multipole Method.We first show that exploiting temporal locality on accesses to communicated data is critical to obtaining good performance on these applications and then argue that coherent caches on shared-address-space machines exploit this locality both automatically and very effectively. Next, we examine the implications of scaling the applications to run on larger machines. We use scaling methods that reflect the concerns of the application scientist and find that this leads to different conclusions about how communication traffic and local cache and memory usage scale than scaling based only on data set size. In particular, we show that under the most realistic form of scaling, both the communication-to-computation ratio as well as the working-set size (and hence the ideal cache size per processor) grow slowly as larger problems are run on larger machines. Finally, we examine the effects of using the two dominant abstractions for interprocessor communication: a shared address space and explicit message passing between private address spaces. We show that the lack of an efficiently supported shared address space will substantially increase the programming complexity and performance overheads for these applications.