Message passing and shared address space parallelism on an SMP cluster
Parallel Computing
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance of a new CFD flow solver using a hybrid programming paradigm
Journal of Parallel and Distributed Computing
ACM SIGGRAPH 2005 Papers
Applied Numerical Mathematics - 6th IMACS International symposium on iterative methods in scientific computing
Performance Modeling of Communication and Computation in Hybrid MPI and OpenMP Applications
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 2
Evaluation of Hierarchical Mesh Reorderings
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Applied Numerical Mathematics - 6th IMACS International symposium on iterative methods in scientific computing
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs
Parallel Computing
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
International Journal of High Performance Computing Applications
Journal of Computational Physics
Adjacency-based data reordering algorithm for acceleration of finite element computations
Scientific Programming
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs
Microprocessors & Microsystems
Exploiting parallelism in physically-based simulations on multi-core processor architectures
EG PGV'07 Proceedings of the 7th Eurographics conference on Parallel Graphics and Visualization
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor
Journal of Parallel and Distributed Computing
Hi-index | 0.01 |
The conjugate gradient (CG) algorithm is perhaps the best-known iterative technique for solving sparse linear systems that are symmetric and positive definite. For systems that are ill conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(0) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance on both distributed and distributed shared-memory systems, cache reuse may be more important than reducing communication, it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and a hybrid MPI + OpenMP paradigm increases programming complexity with little performance gain. A multithreaded implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread-level parallelism.