A Partitioning Strategy for Nonuniform Problems on Multiprocessors
IEEE Transactions on Computers
Local adaptive mesh refinement for shock hydrodynamics
Journal of Computational Physics
Clustering Algorithms
An Integrated Decomposition and Partitioning Approach for Irregular Block-Structured Applications
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Enhancing scalability of parallel structured AMR calculations
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A parallel software infrastructure for dynamic block-irregular scientific calculations
A parallel software infrastructure for dynamic block-irregular scientific calculations
Irregular Coarse-Grain Data Parallelism under LPARX
Scientific Programming
Journal of Computational Physics
Journal of Computational Physics
Novel views of performance data to analyze large-scale adaptive applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Enzo-P / Cello: scalable adaptive mesh refinement for astrophysics and cosmology
Proceedings of the Extreme Scaling Workshop
Compact cell-centered discretization stencils at fine-coarse block structured grid interfaces
Journal of Computational Physics
MuPIF - A distributed multi-physics integration tool
Advances in Engineering Software
Hi-index | 0.01 |
We compare several different parallel implementation approaches for the clustering operations performed during adaptive meshing operations in patch-based structured adaptive mesh refinement (SAMR) applications. Specifically, we target the clustering algorithm of Berger and Rigoutsos, which is commonly used in many SAMR applications. The baseline for comparison is a single program, multiple data extension of the original algorithm that works well for up to O(102) processors. Our goal is a clustering algorithm for machines of up to O(105) processors, such as the 64K-processor IBM BlueGene/L (BG/L) system. We first present an algorithm that avoids unneeded communications of the baseline approach, improving the clustering speed by up to an order of magnitude. We then present a new task-parallel implementation to further reduce communication wait time, adding another order of magnitude of improvement. The new algorithms exhibit more favorable scaling behavior for our test problems. Performance is evaluated on a number of large-scale parallel computer systems, including a 16K-processor BG/L system.