Implementing a parallel matrix factorization library on the cell broadband engine

Authors:
B. C. Vishwas;Abhishek Gadia;Mainak Chaudhuri
Affiliations:
-;-;(Corresponding author. Tel.: +91 512 2597890/ Fax: +91 512 2590725/ E-mail: mainakc@cse.iitk.ac.in) Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur 208016, I ...
Venue:
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Year:
2009

Citing 21
Cited 2

A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix

SIAM Journal on Matrix Analysis and Applications
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
An efficient block-oriented approach to parallel sparse Cholesky factorization

SIAM Journal on Scientific Computing
A Divide-and-Conquer Algorithm for the Bidiagonal SVD

SIAM Journal on Matrix Analysis and Applications
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
PLAPACK: High Performance through High-Level Abstraction

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Solving Real-World Linear Programs: A Decade and More of Progress

Operations Research
Computing the Singular Value Decomposition with High Relative Accuracy

Computing the Singular Value Decomposition with High Relative Accuracy
Solving secular equations stably and efficiently

Solving secular equations stably and efficiently
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Implementation of mixed precision in solving systems of linear equations on the Cell processor: Research Articles

Concurrency and Computation: Practice & Experience
Block and Parallel Versions of One-Sided Bidiagonalization

SIAM Journal on Matrix Analysis and Applications
Where will all the threads come from?

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
New Fast and Accurate Jacobi SVD Algorithm. I

SIAM Journal on Matrix Analysis and Applications
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development

Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning. The central contribution of this paper is a thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine. The paper explores algorithmic as well as implementation challenges related to the Cell chip-multiprocessor and explains how we achieve near-linear speedup on most of the factorization techniques for a range of matrix sizes. For each of the factorization routines, we identify the bottleneck kernels and explain how we have attempted to resolve the bottleneck and to what extent we have been successful. Our implementations, for the largest data sets that we use, running on a two-node 3.2 GHz Cell BladeCenter (exercising a total of sixteen SPEs), on average, deliver 203.9, 284.6, 81.5, 243.9 and 54.0 GFLOPS for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, respectively. The implementations achieve speedup of 11.2, 12.8, 10.6, 13.0 and 6.2, respectively for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, when running on sixteen SPEs. We discuss the interesting interactions that result from parallelization of the factorization routines on a two-node non-uniform memory access (NUMA) Cell Blade cluster.