Macrotasking the singluar value decomposition of block circulant matrices on the Cray-2

Authors:
J. R. Baker
Affiliations:
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley and Research Medicine and Radiation Biophysics Division, Lawrence Berkeley Laboratory
Venue:
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Year:
1989

Citing 7
Cited 0

Guest Editor's Introduction: Domesticating Parallelism

Computer
Parallel Processing in Ada

Computer
Designing efficient algorithms for parallel computers

Designing efficient algorithms for parallel computers
The fast Fourier transform and its applications

The fast Fourier transform and its applications
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Numerical Methods, Software and Analysis

Numerical Methods, Software and Analysis
Digital Image Restoration

Digital Image Restoration

Quantified Score

Hi-index	0.00

Visualization

Abstract

A parallel algorithm to compute the singular value decomposition (SVD) of block circulant matrices on the Cray-2 is described. For a block circulant form described by M blocks with m x n elements in each block, the computation time using an SVD algorithm for general matrices has a lower bound &OHgr;(M3min(m, n)mn). Using a combination of fast Fourier transform (FFT) and SVD steps, the computation time for block circulant singular value decomposition (BCSVD) has a lower bound &OHgr;(Mmin(m, n)mn); a relative savings of ~ M2. Memory usage bounds are reduced from &THgr;(M2mn) to &THgr;(Mmn); a relative savings of ~ M. For M = m = n = 64, this decreases the computation time from approximately 12 hours to 30 seconds and memory usage is reduced from 768 megabytes to 12 megabytes. The BCSVD algorithm partitions well into n macrotasks with a granularity of &THgr;(mM log M) for the FFT portion of the algorithm. The SVD portion of the algorithm partitions into M macrotasks with a granularity of &OHgr;(min(m, n)mn). Again, for the case where M = m = n = 64, the FFT granularity is 29ms and the SVD granularity is 428ms. A speedup of 3.06 was achieved by using a prescheduled partitioning of tasks. The process creation overhead was 2.63ms. Using a more elaborate self-scheduling method with four synchronizing server processes, a speedup of 3.25 was observed with four processors available. The server synchronization overhead was 0.32ms. Relative memory overhead in both cases was about 4% for data space and 40% for code space.