Parallel tiled QR factorization for multicore architectures

Authors:
Alfredo Buttari;Julien Langou;Jakub Kurzak;Jack Dongarra
Affiliations:
af1 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, U.S.A.;Department of Mathematical Sciences, University of Colorado at Denver and Health Sciences Center, CO, U.S.A.;Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, U.S.A.;Dept. of Elec. Eng. and Comp. Sci., Univ. of Tennessee and Comp. Sci. and Math. Div., Oak Ridge National Laboratory, Oak Ridge, TN, U.S.A. and Univ. of Manchester, Manchester, U.K.
Venue:
Concurrency and Computation: Practice & Experience
Year:
2008

Citing 0
Cited 24

A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A unified co-processor architecture for matrix decomposition

Journal of Computer Science and Technology
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
High-performance up-and-downdating via householder-like transformations

ACM Transactions on Mathematical Software (TOMS)
Tiled QR factorization algorithms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
MR3-SMP: A symmetric tridiagonal eigensolver for multi-core architectures

Parallel Computing
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
An implementation of the tile QR factorization for a GPU and multiple CPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience
Reducing the amount of pivoting in symmetric indefinite systems

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Fast PCA computation in a DBMS with aggregate UDFs and LAPACK

Proceedings of the 21st ACM international conference on Information and knowledge management
Measuring the overhead of Intel C++ Concurrent Collections over Threading Building Blocks for Gauss–Jordan elimination

Concurrency and Computation: Practice & Experience
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)
Hierarchical QR factorization algorithms for multi-core clusters

Parallel Computing
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As multicore systems continue to gain ground in the high-performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine-grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data (referred to as ‘tiles’). These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out-of-order execution of the tasks that will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can be exploited only at the level of the BLAS operations and with vendor implementations. Copyright © 2008 John Wiley & Sons, Ltd.