Minimizing communication in sparse matrix solvers

Authors:
Marghoob Mohiyuddin;Mark Hoemmen;James Demmel;Katherine Yelick
Affiliations:
University of California at Berkeley, CA and Lawrence Berkeley National Laboratory, Berkeley, CA;University of California at Berkeley, CA;University of California at Berkeley, CA;University of California at Berkeley, CA and Lawrence Berkeley National Laboratory, Berkeley, CA
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 13
Cited 10

GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
Implementation of the GMRES method using householder transformations

SIAM Journal on Scientific and Statistical Computing - Telecommunication Programs at U.S. Universities
s-step iterative methods for symmetric linear systems

Journal of Computational and Applied Mathematics
Implicitly Restarted GMRES and Arnoldi Methods for Nonsymmetric Systems of Equations

SIAM Journal on Matrix Analysis and Applications
Quantitative performance modeling of scientific computations and creating locality in numerical algorithms

Quantitative performance modeling of scientific computations and creating locality in numerical algorithms
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Recycling Krylov Subspaces for Sequences of Linear Systems

SIAM Journal on Scientific Computing
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR

SIAM Journal on Scientific Computing
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism

Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Overlapping clusters for distributed computation

Proceedings of the fifth ACM international conference on Web search and data mining
Portable parallel performance from sequential, productive, embedded domain-specific languages

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Fast wavelet transform utilizing a multicore-aware framework

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
Hierarchical QR factorization algorithms for multi-core clusters

Parallel Computing
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

Computers & Mathematics with Applications
Hierarchical Krylov and nested Krylov methods for extreme-scale computing

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our shared-memory implementation on an 8-core Intel Clovertown gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.