An implicit upwind algorithm for computing turbulent flows on unstructured grids
Computers and Fluids
Implicit/multigrid algorithms for incompressible turbulent flows on unstructured grids
Journal of Computational Physics
Convergence Analysis of Pseudo-Transient Continuation
SIAM Journal on Numerical Analysis
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
Achieving high sustained performance in an unstructured mesh CFD application
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems
SIAM Journal on Scientific Computing
Reducing the bandwidth of sparse symmetric matrices
ACM '69 Proceedings of the 1969 24th national conference
Dual-Level Parallel Analysis of Harbor Wave Response Using MPI and OpenMP
International Journal of High Performance Computing Applications
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD
International Journal of High Performance Computing Applications
Analyzing the Parallel Scalability of an Implicit Unstructured Mesh CFD Code
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Performance modeling of deterministic transport computations
Performance analysis and grid computing
Improving the computational intensity of unstructured mesh applications
Proceedings of the 19th annual international conference on Supercomputing
Phase-aware adaptive hardware selection for power-efficient scientific computations
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Low-constant parallel algorithms for finite element simulations using linear octrees
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proceedings of the 22nd annual international conference on Supercomputing
Dendro: parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Evaluation of Hierarchical Mesh Reorderings
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Fast sparse matrix-vector multiplication for TeraFlop/s computers
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Efficient Nonlinear Solvers for Nodal High-Order Finite Elements in 3D
Journal of Scientific Computing
On improving performance and energy profiles of sparse scientific applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Conjugate gradient sparse solvers: performance-power characteristics
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Parallel Geometric Multigrid Method for Finite Elements on Octree Meshes
SIAM Journal on Scientific Computing
Hi-index | 0.00 |
This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific machines, the SGI Origin, the Cray T3E and Beowulf clusters. The code achieves a respectable level ofperformance for sparse problems, typical of scientific and engineering codes based on partial differential equations, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop-orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation.