Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Authors:
Panagiotis D. Michailidis;Konstantinos G. Margaritis
Affiliations:
Department of Balkan Studies, University of Western Macedonia, Florina, Greece;Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece
Venue:
Journal of Computational and Applied Mathematics
Year:
2011

Citing 22
Cited 0

ScaLAPACK user's guide

ScaLAPACK user's guide
Parallel Algorithms and Architectures

Parallel Algorithms and Architectures
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Using modern graphics architectures for general-purpose computing: a framework and analysis

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Parallel Gaussian Elimination Using OpenMP and MPI

HPCS '02 Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementing a parallel matrix factorization library on the cell broadband engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Out-of-Core Computation of the QR Factorization on Multi-core Processors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Scheduling dense linear algebra operations on multicore processors

Concurrency and Computation: Practice & Experience
Implementing linear algebra routines on multi-core processors with pipelining and a look ahead

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Performance study of LU decomposition on the programmable GPU

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

Quantified Score

Hi-index	7.29

Visualization

Abstract

Recent developments in high performance computer architecture have a significant effect on all fields of scientific computing. Linear algebra and especially the solution of linear systems of equations lie at the heart of many applications in scientific computing. This paper describes and analyzes three parallel versions of the dense direct methods such as the Gaussian elimination method and the LU form of Gaussian elimination that are used in linear system solving on a multicore using an OpenMP interface. More specifically, we present two naive parallel algorithms based on row block and row cyclic data distribution and we put special emphasis on presenting a third parallel algorithm based on the pipeline technique. Further, we propose an implementation of the pipelining technique in OpenMP. Experimental results on a multicore CPU show that the proposed OpenMP pipeline implementation achieves good overall performance compared to the other two naive parallel methods. Finally, in this work we propose a simple, fast and reasonably analytical model to predict the performance of the direct methods with the pipelining technique.