Communication-avoiding krylov subspace methods

Authors:
James W. Demmel;Mark Hoemmen
Affiliations:
University of California, Berkeley;University of California, Berkeley
Venue:
Communication-avoiding krylov subspace methods
Year:
2010

Citing 0
Cited 6

A communication-avoiding thick-restart lanczos method on a distributed-memory system

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
Solving Hermitian positive definite systems using indefinite incomplete factorizations

Journal of Computational and Applied Mathematics
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

Computers & Mathematics with Applications
Performance analysis of asynchronous Jacobi's method implemented in MPI, SHMEM and OpenMP

International Journal of High Performance Computing Applications
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Krylov subspace methods (KSMs) are iterative algorithms for solving large, sparse linear systems and eigenvalue problems. Current KSMs rely on sparse matrix-vector multiply (SpMV) and vector-vector operations (like dot products and vector sums). All of these operations are communication-bound. Furthermore, data dependencies between them mean that only a small amount of that communication can be hidden. Many important scientific and engineering computations spend much of their time in Krylov methods, so the performance of many codes could be improved by introducing KSMs that communicate less.Our goal is to take s steps of a KSM for the same communication cost as 1 step, which would be optimal. We call the resulting KSMs "communication-avoiding Krylov methods." This thesis makes the following contributions: (1) We have fast kernels replacing SpMV, that can compute the results of s calls to SpMV for the same communication cost as one call (Section 2.1). (2) We have fast dense kernels as well, such as Tall Skinny QR (TSQR – Section 2.3) and Block Gram-Schmidt (BGS – Section 2.4), which can do the work of Modified Gram-Schmidt applied to s vectors for a factor of Θ(s2) fewer messages in parallel, and a factor of Θ(s/W) fewer words transferred between levels of the memory hierarchy (where W is the fast memory capacity in words). (3) We have new communication-avoiding Block Gram-Schmidt algorithms for orthogonalization in more general inner products (Section 2.5). (4) We have new communication-avoiding versions of the following Krylov subspace methods for solving linear systems: the Generalized Minimum Residual method (GMRES – Section 3.4), both unpreconditioned and preconditioned, and the Method of Conjugate Gradients (CG), both unpreconditioned (Section 5.4) and left-preconditioned (Section 5.5). (5) We have new communication-avoiding versions of the following Krylov subspace methods for solving eigenvalue problems, both standard (Ax = λx, for a nonsingular matrix A) and "generalized" (Ax = λMx, for nonsingular matrices A and M): Arnoldi iteration (Section 3.3), and Lanczos iteration, both for Ax = λx (Section 4.2) and Ax = λMx (Section 4.3). (6) We propose techniques for developing communication-avoiding versions of nonsymmetric Lanczos iteration (for solving nonsymmetric eigenvalue problems Ax = λx) and the Method of Biconjugate Gradients (BiCG) for solving linear systems. (7) We can combine more stable numerical formulations that use different bases of Krylov subspaces with our techniques for avoiding communication. For a discussion of different bases, see Chapter 7. To see an example of how the choice of basis affects the formulation of the Krylov method, see Section 3.2.2. (8) We have faster numerical formulations. For example, in our communication-avoiding version of GMRES, CA-GMRES (see Section 3.4), we can pick the restart length r independently of the s-step basis length s. Experiments in Section 3.5.5 show that this ability improves numerical stability. We show in Section 3.6.3 that it also improves performance in practice, resulting in a 2.23× speedup in the CA-GMRES implementation described below. (9) We combine all of these numerical and performance techniques in a shared-memory parallel implementation of our communication-avoiding version of GMRES, CA-GMRES. Compared to a similarly highly optimized version of standard GMRES, when both are running in parallel on 8 cores of an Intel Clovertown (see Appendix A), CA-GMRES achieves 4.3× speedups over standard GMRES on standard sparse test matrices (described in Appendix B.5). When both are running in parallel on 8 cores of an Intel Nehalem (see Appendix A), CA-GMRES achieves 4.1× speedups. See Section 3.6 for performance results and Section 3.5 for corresponding numerical experiments. We first reported performance results for this implementation on the Intel Clovertown platform in Demmel et al. [78]. (10) We have incorporated preconditioning into our methods. Note that we have not yet developed practical communication-avoiding preconditioners; this is future work. We have accomplished the following: (a) We show (in Sections 2.2 and 4.3) what the s-step basis should compute in the preconditioned case for many different types of Krylov methods and s-step bases. We explain why this is hard in Section 4.3. (b) We have identified two different structures that a preconditioner may have, in order to achieve the desired optimal reduction of communication by a factor of s. See Section 2.2 for details. (Abstract shortened by UMI.)