Jacobi's method is more accurage than QR
SIAM Journal on Matrix Analysis and Applications
Computing the Singular-Value Decomposition on the ILLIAC IV
ACM Transactions on Mathematical Software (TOMS)
International Journal of Parallel Programming
On parallel implementation of the one-sided Jacobi algorithm for singular value decompositions
PDP '95 Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
The Software Optimization Cookbook
The Software Optimization Cookbook
Sourcebook of parallel computing
Sourcebook of parallel computing
Programming With Hyper-Threading Technology
Programming With Hyper-Threading Technology
Parallel Computation of the Singular Value Decomposition on Tree Architectures
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03
A novel scheme for the parallel computation of SVDs
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Hi-index | 0.00 |
The performance of many applications on modern computers is often limited by memory latency rather than by processor speed. For computers with memory hierarchy, it is preferable to perform the computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. This paper proposes a fast algorithm for parallel computing the extremely useful singular value decomposition (SVD) based on one-sided Jacobi on multi-level memory hierarchy architectures. On Pparallel processors, the given matrix is divided into super-rows and then these super-rows are partitioned into 2Pblocks. One key point of the proposed algorithm is the highly exploitation of memory hierarchy by performing all computations on super-rows loaded in cache memory rather than on rows. Another key point is that the number of sweeps required for convergence is very close to cyclic one-sided Jacobi. Third key point of the proposed algorithm is that the number of sweeps required for convergence does not depend drastically on the size of the input matrix. On two dual-core Intel Xeon processors, our results show that the performance of parallel implementation of the proposed algorithm is around 11 times higher than the sequential implementation on the same hardware. Moreover, a performance of around 10 GFLOPS (double-precision) can be achieved on the target system using multi-threading, Intel SIMD instructions, matrix blocking, and loop unrolling techniques.