Matrix computations (3rd ed.)
ScaLAPACK user's guide
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
Rounding Errors in Algebraic Processes
Rounding Errors in Algebraic Processes
An estimator for the diagonal of a matrix
Applied Numerical Mathematics
Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
International Journal of High Performance Computing Applications
Optimizing task layout on the Blue Gene/L supercomputer
IBM Journal of Research and Development
Domain-Decomposition-Type Methods for Computing the Diagonal of a Matrix Inverse
SIAM Journal on Scientific Computing
Low-cost data uncertainty quantification
Concurrency and Computation: Practice & Experience
Fast approximation of matrix coherence and statistical leverage
The Journal of Machine Learning Research
Hi-index | 0.00 |
Uncertainty quantification in risk analysis has become a key application. In this context, computing the diagonal of inverse covariance matrices is of paramount importance. Standard techniques, that employ matrix factorizations, incur a cubic cost which quickly becomes intractable with the current explosion of data sizes. In this work we reduce this complexity to quadratic with the synergy of two algorithms that gracefully complement each other and lead to a radically different approach. First, we turned to stochastic estimation of the diagonal. This allowed us to cast the problem as a linear system with a relatively small number of multiple right hand sides. Second, for this linear system we developed a novel, mixed precision, iterative refinement scheme, which uses iterative solvers instead of matrix factorizations. We demonstrate that the new framework not only achieves the much needed quadratic cost but in addition offers excellent opportunities for scaling at massively parallel environments. We based our implementation on BLAS 3 kernels that ensure very high processor performance. We achieved a peak performance of 730 TFlops on 72 BG/P racks, with a sustained performance 73% of theoretical peak. We stress that the techniques presented in this work are quite general and applicable to several other important applications.