Scheduling two-sided transformations using tile algorithms on multicore architectures

Authors:
Hatem Ltaief;Jakub Kurzak;Jack Dongarra;Rosa M. Badia
Affiliations:
Department of Electrical Engineering and Computer Science, University of Tennessee, TN, USA;Department of Electrical Engineering and Computer Science, University of Tennessee, TN, USA;(Correspd.E-mail: dongarra@eecs.utk.edu) Dept. of Elec. Eng. and Comp. Sci., Univ. of Tennessee, TN, USA and Comp. Sci. and Math. Div.n, Oak Ridge National Lab.atory, TN, USA and Sch. of Math. and ...;Barcelona Supercomputing Center - Centro Nacional de Supercomputación, Consejo Nacional de Investigaciones Cientificas, Barcelona, Spain
Venue:
Scientific Programming
Year:
2010

Citing 22
Cited 3

A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Vector and parallel algorithms for Cholesky factorization on IBM 3090

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Solving Linear Algebraic Equations on an MIMD Computer

Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
A Parallel Algorithm for the Reduction to Tridiagonal Form for Eigendecomposition

SIAM Journal on Scientific Computing
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Block and Parallel Versions of One-Sided Bidiagonalization

SIAM Journal on Matrix Analysis and Applications
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Implementing linear algebra routines on multi-core processors with pipelining and a look ahead

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing

A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems

SIAM Journal on Scientific Computing
Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures

Parallel Computing
Efficient reduction from block hessenberg form to hessenberg form using shared memory

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38-53) introduced the concept of tile algorithms in which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.