A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix
SIAM Journal on Matrix Analysis and Applications
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
An efficient block-oriented approach to parallel sparse Cholesky factorization
SIAM Journal on Scientific Computing
A Divide-and-Conquer Algorithm for the Bidiagonal SVD
SIAM Journal on Matrix Analysis and Applications
IBM Journal of Research and Development
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Matrix computations (3rd ed.)
PLAPACK: High Performance through High-Level Abstraction
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Solving Real-World Linear Programs: A Decade and More of Progress
Operations Research
Computing the Singular Value Decomposition with High Relative Accuracy
Computing the Singular Value Decomposition with High Relative Accuracy
Solving secular equations stably and efficiently
Solving secular equations stably and efficiently
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Concurrency and Computation: Practice & Experience
Block and Parallel Versions of One-Sided Bidiagonalization
SIAM Journal on Matrix Analysis and Applications
Where will all the threads come from?
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
New Fast and Accurate Jacobi SVD Algorithm. I
SIAM Journal on Matrix Analysis and Applications
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Journal of Computational and Applied Mathematics
Hi-index | 0.00 |
Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning. The central contribution of this paper is a thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine. The paper explores algorithmic as well as implementation challenges related to the Cell chip-multiprocessor and explains how we achieve near-linear speedup on most of the factorization techniques for a range of matrix sizes. For each of the factorization routines, we identify the bottleneck kernels and explain how we have attempted to resolve the bottleneck and to what extent we have been successful. Our implementations, for the largest data sets that we use, running on a two-node 3.2 GHz Cell BladeCenter (exercising a total of sixteen SPEs), on average, deliver 203.9, 284.6, 81.5, 243.9 and 54.0 GFLOPS for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, respectively. The implementations achieve speedup of 11.2, 12.8, 10.6, 13.0 and 6.2, respectively for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, when running on sixteen SPEs. We discuss the interesting interactions that result from parallelization of the factorization routines on a two-node non-uniform memory access (NUMA) Cell Blade cluster.