Memory hierarchy exploration for accelerating the parallel computation of SVDs

Authors:
Mostafa I. Soliman
Affiliations:
Computer & System Section, Electrical Engineering Department, Faculty of Engineering, South Valley University, Aswan, Egypt
Venue:
Neural, Parallel & Scientific Computations
Year:
2008

Citing 10
Cited 0

Jacobi's method is more accurage than QR

SIAM Journal on Matrix Analysis and Applications
Computing the Singular-Value Decomposition on the ILLIAC IV

ACM Transactions on Mathematical Software (TOMS)
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
On parallel implementation of the one-sided Jacobi algorithm for singular value decompositions

PDP '95 Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
The Software Optimization Cookbook

The Software Optimization Cookbook
Sourcebook of parallel computing

Sourcebook of parallel computing
Programming With Hyper-Threading Technology

Programming With Hyper-Threading Technology
Parallel Computation of the Singular Value Decomposition on Tree Architectures

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03
A novel scheme for the parallel computation of SVDs

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of many applications on modern computers is often limited by memory latency rather than by processor speed. For computers with memory hierarchy, it is preferable to perform the computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. This paper proposes a fast algorithm for parallel computing the extremely useful singular value decomposition (SVD) based on one-sided Jacobi on multi-level memory hierarchy architectures. On Pparallel processors, the given matrix is divided into super-rows and then these super-rows are partitioned into 2Pblocks. One key point of the proposed algorithm is the highly exploitation of memory hierarchy by performing all computations on super-rows loaded in cache memory rather than on rows. Another key point is that the number of sweeps required for convergence is very close to cyclic one-sided Jacobi. Third key point of the proposed algorithm is that the number of sweeps required for convergence does not depend drastically on the size of the input matrix. On two dual-core Intel Xeon processors, our results show that the performance of parallel implementation of the proposed algorithm is around 11 times higher than the sequential implementation on the same hardware. Moreover, a performance of around 10 GFLOPS (double-precision) can be achieved on the target system using multi-threading, Intel SIMD instructions, matrix blocking, and loop unrolling techniques.