The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
The algebraic eigenvalue problem
The algebraic eigenvalue problem
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Vector and parallel algorithms for Cholesky factorization on IBM 3090
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Average-case stability of Gaussian elimination
SIAM Journal on Matrix Analysis and Applications
Matrix computations (3rd ed.)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Solving Linear Algebraic Equations on an MIMD Computer
Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
A Proposal for a Set of Parallel Basic Linear Algebra Subprograms
PARA '95 Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
High-performance linear algebra algorithms using new generalized data structures for matrices
IBM Journal of Research and Development
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Analysis of Pairwise Pivoting in Gaussian Elimination
IEEE Transactions on Computers
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
A collection of parallel linear equations routines for the Denelcor HEP
Parallel Computing
The impact of multicore on math software
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Three algorithms for Cholesky factorization on distributed memory using packed storage
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Rapid development of high-performance out-of-core solvers
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Declarative aspects of memory management in the concurrent collections parallel programming model
Proceedings of the 4th workshop on Declarative aspects of multicore programming
Hierarchical Task-Based Programming With StarSs
International Journal of High Performance Computing Applications
Asynchronous Language and System of Numerical Algorithms Fragmented Programming
PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion
ACM Transactions on Mathematical Software (TOMS)
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training
The Journal of Machine Learning Research
State-of-the-art in heterogeneous computing
Scientific Programming
Scheduling two-sided transformations using tile algorithms on multicore architectures
Scientific Programming
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A parallel non-square tiled algorithm for solving a kind of BVP for second-order ODEs
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Journal of Computational and Applied Mathematics
Journal of Computational and Applied Mathematics
A fully empirical autotuned dense QR factorization for multicore architectures
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Tiled QR factorization algorithms
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Design of a Multicore Sparse Cholesky Factorization Using DAGs
SIAM Journal on Scientific Computing
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Fine granularity sparse QR factorization for multicore based systems
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
Concurrency and Computation: Practice & Experience
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Cache blocking for linear algebra algorithms
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Reducing the amount of pivoting in symmetric indefinite systems
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Performance study of matrix computations using multi-core programming tools
Proceedings of the Fifth Balkan Conference in Informatics
Concurrency and Computation: Practice & Experience
Computer Science - Research and Development
From serial loops to parallel execution on distributed systems
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Benefits of using parallelized non-progressive network coding
Journal of Network and Computer Applications
Accelerating Linear System Solutions Using Randomization Techniques
ACM Transactions on Mathematical Software (TOMS)
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
Parallel implementation of the sherman-morrison matrix inverse algorithm
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Hierarchical QR factorization algorithms for multi-core clusters
Parallel Computing
Work-efficient matrix inversion in polylogarithmic time
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multifrontal QR factorization for multicore architectures over runtime systems
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning
International Journal of Parallel Programming
Hi-index | 0.00 |
As multicore systems continue to gain ground in the high performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents algorithms for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in out of order execution of tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations.