Communication avoiding successive band reduction

Authors:
Grey Ballard;James Demmel;Nicholas Knight
Affiliations:
UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 13
Cited 0

The input/output complexity of sorting and related problems

Communications of the ACM
A parallel algorithm for reducing symmetric banded matrices to tridiagonal form

SIAM Journal on Scientific Computing
Efficient eigenvalue and singular value computations on shared memory machines

Parallel Computing - Special issue on parallelization techniques for numerical modelling
Banded Eigenvalue Solvers on Vector Machines

ACM Transactions on Mathematical Software (TOMS)
Band reduction algorithms revisited

ACM Transactions on Mathematical Software (TOMS)
A framework for symmetric band reduction

ACM Transactions on Mathematical Software (TOMS)
Algorithm 807: The SBR Toolbox—software for successive band reduction

ACM Transactions on Mathematical Software (TOMS)
Algorithm 183: reduction of a symmetric bandmatrix to triple diagonal form

Communications of the ACM
Cache efficient bidiagonalization using BLAS 2.5 operators

ACM Transactions on Mathematical Software (TOMS)
The Future of Computing Performance: Game Over or Next Level?

The Future of Computing Performance: Game Over or Next Level?
Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor.