An implementation of the fast multipole method without multipoles
SIAM Journal on Scientific and Statistical Computing
Astrophysical N-body simulations using hierarchical tree data structures
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A parallel adaptive fast multipole method
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
SIAM Journal on Scientific Computing
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation
SIAM Journal on Scientific Computing
A unifying data structure for hierarchical methods
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Scalable electromagnetic scattering calculations on the SGI Origin 2000
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A fast adaptive multipole algorithm in three dimensions
Journal of Computational Physics
A data-parallel implementation of O(N) hierarchical N-body methods
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Scalable atomistic simulation algorithms for materials research
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A new version of the fast multipole method for screened Coulomb interactions in three dimensions
Journal of Computational Physics
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Scalable Parallel Octree Meshing for TeraScale Applications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Fast multipole method for the biharmonic equation in three dimensions
Journal of Computational Physics
A high-order 3D boundary integral equation solver for elliptic PDEs in smooth domains
Journal of Computational Physics
Krylov deferred correction accelerated method of lines transpose for parabolic problems
Journal of Computational Physics
Low-constant parallel algorithms for finite element simulations using linear octrees
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
On the limits of GPU acceleration
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
SIAM Journal on Scientific Computing
A massively parallel adaptive fast multipole method on heterogeneous architectures
Communications of the ACM
A parallel and incremental extraction of variational capacitance with stochastic geometric moments
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.03 |
We present a new adaptive fast multipole algorithm and its parallel implementation. The algorithm is kernel-independent in the sense that the evaluation of pairwise interactions does not rely on any analytic expansions, but only utilizes kernel evaluations. The new method provides the enabling technology for many important problems in computational science and engineering. Examples include viscous flows, fracture mechanics and screened Coulombic interactions. Our MPI-based parallel implementation logically separates the computation and communication phases to avoid synchronization in the upward and downward computation passes, and thus allows us to fully exploit computation and communication overlapping. We measure isogranular and fixed-size scalability for a variety of kernels on the Pittsburgh Supercomputing Center's TCS-1 Alphaserver on up to 3000 processors. We have solved viscous flow problems with up to 2.1 billion unknowns and we have achieved 1.6 Tflops/s peak performance and 1.13 Tflops/s sustained performance.