Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers

Authors:
Anshul Gupta;Vipin Kumar;Ahmed Sameh
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1995

Citing 17
Cited 12

Reevaluating Amdahl's law

Communications of the ACM
Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes

IEEE Transactions on Computers
Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
Measuring the scalability of parallel computer systems

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Measuring parallel processor performance

Communications of the ACM
The effect of time constraints on scaled speedup

SIAM Journal on Scientific and Statistical Computing
Scalability of parallel machines

Communications of the ACM
Hypercube algorithms: with applications to image processing and pattern recognition

Hypercube algorithms: with applications to image processing and pattern recognition
Analysis of scalability of parallel algorithms and architectures: a survey

ICS '91 Proceedings of the 5th international conference on Supercomputing
Computing biconnected on a hypercube

The Journal of Supercomputing
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
A Comparison of Several Bandwidth and Profile Reduction Algorithms

ACM Transactions on Mathematical Software (TOMS)
Advanced Computer Architecture: Parallelism,Scalability,Programmability

Advanced Computer Architecture: Parallelism,Scalability,Programmability
Computer Solution of Large Sparse Positive Definite

Computer Solution of Large Sparse Positive Definite
A scalable parallel algorithm for sparse Cholesky factorization

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures

IEEE Parallel & Distributed Technology: Systems & Technology
The Scalability of FFT on Parallel Computers

IEEE Transactions on Parallel and Distributed Systems

Future applicability of bus-based shared memory multiprocessors

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Mapping Conjugate Gradient Algorithms for Neutron Diffusion Applications onto SIMD, MIMD, and Mixed-Mode Machines

International Journal of Parallel Programming
On the Influence of Start-Up Costs in Scheduling Divisible Loads on Bus Networks

IEEE Transactions on Parallel and Distributed Systems
Parallel Krylov Methods for Econometric Model Simulation

Computational Economics - Special issue on computational studies at Cambridge
Relationships Between Efficiency and Execution Time of Full Multigrid Methods on Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
The Parallel Algorithm of Conjugate Gradient Method

IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
Fast Cloth Simulation with Parallel Computers

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Analysis of Parallel Preconditioned Conjugate Gradient Algorithms

Informatica
Improving the Performance of Multiple Conjugate Gradient Solvers by Exploiting Overlap

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Self-similarity of parallel machines

Parallel Computing
Design and analysis of load distribution strategies with start-up costs in scheduling divisible loads on distributed networks

Mathematical and Computer Modelling: An International Journal
Computer performance analysis and the Pi Theorem

Computer Science - Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5驴* parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor. As a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. For the matrices resulting from three-dimensional finite difference grids, the scalability is quite good on a hypercube or the CM-5, but not as good on a 2-D mesh architecture. In the case of unstructured sparse matrices with a constant number of nonzero elements in each row, the parallel formulation of the PCG iteration is unscalable on any message passing parallel architecture, unless some ordering is applied on the sparse matrix. The parallel system can be made scalable either if, after reordering, the nonzero elements of the $N\times N$ matrix can be confined in a band whose width is $O(N^y)$ for any $y\char'74 1$, or if the number of nonzero elements per row increases as $N^x$ for any $x 0$. Scalability increases as the number of nonzero elements per row is increased and/or the width of the band containing these elements is reduced. For unstructured sparse matrices, the scalability is asymptotically the same for all architectures. Many of these analytical results are experimentally verified on the CM-5 parallel computer.