On the spectral decomposition of Hermitian matrices modified by low rank perturbations
SIAM Journal on Matrix Analysis and Applications
Accurate singular values of bidiagonal matrices
SIAM Journal on Scientific and Statistical Computing
LAPACK's user's guide
Numerical solution of a secular equation
Numerische Mathematik
A numerical comparison of methods for solving secular equations
Journal of Computational and Applied Mathematics - Special issue: dedicated to William B. Gragg on the occasion of his 60th Birthday
Solving secular equations stably and efficiently
Solving secular equations stably and efficiently
LAPACK Working Note 88: Efficient Computation of the Singular Value Decomposition with Applications to Least Squares Problems
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
Singular value decomposition on GPU using CUDA
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Communication-Avoiding QR Decomposition for GPUs
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Hi-index | 0.00 |
Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.