GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems
SIAM Journal on Scientific and Statistical Computing
Implementation of the GMRES method using householder transformations
SIAM Journal on Scientific and Statistical Computing - Telecommunication Programs at U.S. Universities
s-step iterative methods for symmetric linear systems
Journal of Computational and Applied Mathematics
Implicitly Restarted GMRES and Arnoldi Methods for Nonsymmetric Systems of Equations
SIAM Journal on Matrix Analysis and Applications
Quantitative performance modeling of scientific computations and creating locality in numerical algorithms
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Recycling Krylov Subspaces for Sequences of Linear Systems
SIAM Journal on Scientific Computing
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Nonnegative Diagonals and High Performance on Low-Profile Matrices from Householder QR
SIAM Journal on Scientific Computing
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Optimizing collective communication on multicores
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Overlapping clusters for distributed computation
Proceedings of the fifth ACM international conference on Web search and data mining
Portable parallel performance from sequential, productive, embedded domain-specific languages
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Fast wavelet transform utilizing a multicore-aware framework
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators
Concurrency and Computation: Practice & Experience
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
Hierarchical QR factorization algorithms for multi-core clusters
Parallel Computing
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers
Computers & Mathematics with Applications
Hierarchical Krylov and nested Krylov methods for extreme-scale computing
Parallel Computing
Hi-index | 0.00 |
Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our shared-memory implementation on an 8-core Intel Clovertown gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.