A fast algorithm for particle simulations
Journal of Computational Physics
Analysis of cache invalidation patterns in multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
LimitLESS directories: A scalable cache coherence scheme
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A rapid hierarchical radiosity algorithm
Proceedings of the 18th annual conference on Computer graphics and interactive techniques
Journal of Parallel and Distributed Computing
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods and their implications for multiprocessors
Parallel hierarchical N-body methods and their implications for multiprocessors
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Journal of Parallel and Distributed Computing
Hierarchical algorithms and architectures for parallel scientific computing
ICS '90 Proceedings of the 4th international conference on Supercomputing
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The working set model for program behavior
Communications of the ACM
Tango introduction and tutorial
Tango introduction and tutorial
SPLASH: Stanford parallel applications for shared-memory*
SPLASH: Stanford parallel applications for shared-memory*
The rapid evaluation of potential fields in particle systems
The rapid evaluation of potential fields in particle systems
A parallel adaptive fast multipole method
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Message passing versus distributed shared memory on networks of workstations
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
ICS '97 Proceedings of the 11th international conference on Supercomputing
Performance implications of communication mechanisms in all-software global address space systems
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Predicting the performance of distributed virtual shared-memory applications
IBM Systems Journal
Scaling application performance on a cache-coherent multiprocessor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications
IEEE Transactions on Parallel and Distributed Systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Overlapping multi-processing and graphics hardware acceleration: performance evaluation
PVGS '99 Proceedings of the 1999 IEEE symposium on Parallel visualization and graphics
Experiences with Parallel N-Body Simulation
IEEE Transactions on Parallel and Distributed Systems
Accelerating shared virtual memory via general-purpose network interface support
ACM Transactions on Computer Systems (TOCS)
A comparison of three programming models for adaptive applications on the Origin2000
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A comparison of three programming models for adaptive applications on the origin2000
Journal of Parallel and Distributed Computing
International Journal of Parallel Programming
Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Journal of Parallel and Distributed Computing
Solving irregularly structured problems based on distributed object model
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Massively parallel implementation of a fast multipole method for distributed memory machines
Journal of Parallel and Distributed Computing
Irregular computations in Fortran - expression and implementation strategies
Scientific Programming
Data exploration of turbulence simulations using a database cluster
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Fast multipole methods on graphics processors
Journal of Computational Physics
Study of hierarchical n-body methods for network-on-chip architectures
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Memory-access optimization of parallel molecular dynamics simulation via dynamic data reordering
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.01 |
To design effective large-scale multiprocessors, designers need to understand the characteristics of the applications that will use the machines. Application characteristics of particular interest include the amount of communication relative to computation, the structure of the communication, and the local cache and memory requirements, as well as how these characteristics scale with larger problems and machines. One important class of applications is based on hierarchical N-body methods, which are used to solve a wide range of scientific and engineering problems efficiently. Important characteristics of these methods include the nonuniform and dynamically changing nature of the domains to which they are applied, and their use of long-range, irregular communication. This article examines the key architectural implications of representative applications that use the two dominant hierarchical N-body methods: the Barnes-Hut Method and the Fast Multipole Method.We first show that exploiting temporal locality on accesses to communicated data is critical to obtaining good performance on these applications and then argue that coherent caches on shared-address-space machines exploit this locality both automatically and very effectively. Next, we examine the implications of scaling the applications to run on larger machines. We use scaling methods that reflect the concerns of the application scientist and find that this leads to different conclusions about how communication traffic and local cache and memory usage scale than scaling based only on data set size. In particular, we show that under the most realistic form of scaling, both the communication-to-computation ratio as well as the working-set size (and hence the ideal cache size per processor) grow slowly as larger problems are run on larger machines. Finally, we examine the effects of using the two dominant abstractions for interprocessor communication: a shared address space and explicit message passing between private address spaces. We show that the lack of an efficiently supported shared address space will substantially increase the programming complexity and performance overheads for these applications.