A class of parallel tiled linear algebra algorithms for multicore architectures

Authors:
Alfredo Buttari;Julien Langou;Jakub Kurzak;Jack Dongarra
Affiliations:
Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, TN, United States;Department of Mathematical and Statistical Sciences, University of Colorado Denver, CO, United States;Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, TN, United States;Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, TN, United States and Oak Ridge National Laboratory, Oak Ridge, TN, United States and University of Manc ...
Venue:
Parallel Computing
Year:
2009

Citing 28
Cited 43

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
The algebraic eigenvalue problem

The algebraic eigenvalue problem
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Vector and parallel algorithms for Cholesky factorization on IBM 3090

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Parallel algorithms for dense linear algebra computations

SIAM Review
Average-case stability of Gaussian elimination

SIAM Journal on Matrix Analysis and Applications
A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-Hessenberg form

Parallel Computing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Solving Linear Algebraic Equations on an MIMD Computer

Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
A Proposal for a Set of Parallel Basic Linear Algebra Subprograms

PARA '95 Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Analysis of Pairwise Pivoting in Gaussian Elimination

IEEE Transactions on Computers
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
A collection of parallel linear equations routines for the Denelcor HEP

Parallel Computing
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Three algorithms for Cholesky factorization on distributed memory using packed storage

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Rapid development of high-performance out-of-core solvers

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing

Declarative aspects of memory management in the concurrent collections parallel programming model

Proceedings of the 4th workshop on Declarative aspects of multicore programming
Hierarchical Task-Based Programming With StarSs

International Journal of High Performance Computing Applications
Asynchronous Language and System of Numerical Algorithms Fragmented Programming

PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

ACM Transactions on Mathematical Software (TOMS)
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training

The Journal of Machine Learning Research
State-of-the-art in heterogeneous computing

Scientific Programming
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Leveraging the power of multi-core platforms for large-scale geospatial data processing: Exemplified by generating DEM from massive LiDAR point clouds

Computers & Geosciences
A parallel non-square tiled algorithm for solving a kind of BVP for second-order ODEs

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
Solving a kind of boundary-value problem for ordinary differential equations using Fermi-The next generation CUDA computing architecture

Journal of Computational and Applied Mathematics
A fully empirical autotuned dense QR factorization for multicore architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Tiled QR factorization algorithms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Design of a Multicore Sparse Cholesky Factorization Using DAGs

SIAM Journal on Scientific Computing
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Fine granularity sparse QR factorization for multicore based systems

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
An implementation of the tile QR factorization for a GPU and multiple CPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications
LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Reducing the amount of pivoting in symmetric indefinite systems

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Performance study of matrix computations using multi-core programming tools

Proceedings of the Fifth Balkan Conference in Informatics
Measuring the overhead of Intel C++ Concurrent Collections over Threading Building Blocks for Gauss–Jordan elimination

Concurrency and Computation: Practice & Experience
Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

Computer Science - Research and Development
From serial loops to parallel execution on distributed systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Benefits of using parallelized non-progressive network coding

Journal of Network and Computer Applications
Accelerating Linear System Solutions Using Randomization Techniques

ACM Transactions on Mathematical Software (TOMS)
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)
Parallel implementation of the sherman-morrison matrix inverse algorithm

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Hierarchical QR factorization algorithms for multi-core clusters

Parallel Computing
Work-efficient matrix inversion in polylogarithmic time

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multifrontal QR factorization for multicore architectures over runtime systems

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

As multicore systems continue to gain ground in the high performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents algorithms for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in out of order execution of tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations.