Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

Authors:
Jakub Kurzak;Alfredo Buttari;Jack Dongarra
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2008

Citing 0
Cited 23

A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Parallel Computing
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Massive video-surveillance parallelization on the cell broadband engine processor

IBM Journal of Research and Development
Implementation and performance analysis of parallel conjugate gradient on the cell broadband engine

IBM Journal of Research and Development
A scalable high performant Cholesky factorization for multicore with GPU accelerators

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A performance evaluation on monte carlo simulation for radiation dosimetry using cell processor

Journal of Computational Methods in Sciences and Engineering
Topological decomposition algorithm for optimized solution of a system of linear equations

Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environments
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Parallelization of pagerank on multicore processors

ICDCIT'12 Proceedings of the 8th international conference on Distributed Computing and Internet Technology
Parallelization and performance comparison of the conjugate gradient equation solver on multicore Cell and Xeon computers

Concurrency and Computation: Practice & Experience
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience
New level-3 BLAS kernels for cholesky factorization

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in processor architecture. At the same time it presents new challenges for the development of numerical algorithms. One is effective exploitation of the differential between the speed of single and double precision arithmetic; the other is efficient parallelization between the short vector SIMD cores. The first challenge is addressed by utilizing the well known technique of iterative refinement for the solution of a dense symmetric positive definite system of linear equations, resulting in a mixed-precision algorithm, which delivers double precision accuracy, while performing the bulk of the work in single precision. The main contribution of this paper lies in addressing the second challenge by successful thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization. The implementation of the computationally intensive sections gets within 90 percent of peak floating point performance, while the implementation of the memory intensive sections reaches within 90 percent of peak memory bandwidth. On a single CELL processor, the algorithm achieves over 170~Gflop/s when solving a symmetric positive definite system of linear equation in single precision and over 150~Gflop/s when delivering the result in double precision accuracy.