The input/output complexity of sorting and related problems
Communications of the ACM
A parallel algorithm for reducing symmetric banded matrices to tridiagonal form
SIAM Journal on Scientific Computing
Efficient eigenvalue and singular value computations on shared memory machines
Parallel Computing - Special issue on parallelization techniques for numerical modelling
Banded Eigenvalue Solvers on Vector Machines
ACM Transactions on Mathematical Software (TOMS)
Band reduction algorithms revisited
ACM Transactions on Mathematical Software (TOMS)
A framework for symmetric band reduction
ACM Transactions on Mathematical Software (TOMS)
Algorithm 807: The SBR Toolbox—software for successive band reduction
ACM Transactions on Mathematical Software (TOMS)
Algorithm 183: reduction of a symmetric bandmatrix to triple diagonal form
Communications of the ACM
Cache efficient bidiagonalization using BLAS 2.5 operators
ACM Transactions on Mathematical Software (TOMS)
The Future of Computing Performance: Game Over or Next Level?
The Future of Computing Performance: Game Over or Next Level?
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
Hi-index | 0.00 |
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor.