QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Scheduling two-sided transformations using tile algorithms on multicore architectures
Scientific Programming
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A unified co-processor architecture for matrix decomposition
Journal of Computer Science and Technology
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Journal of Computational and Applied Mathematics
High-performance up-and-downdating via householder-like transformations
ACM Transactions on Mathematical Software (TOMS)
Tiled QR factorization algorithms
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
MR3-SMP: A symmetric tridiagonal eigensolver for multi-core architectures
Parallel Computing
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Concurrency and Computation: Practice & Experience
Reducing the amount of pivoting in symmetric indefinite systems
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Fast PCA computation in a DBMS with aggregate UDFs and LAPACK
Proceedings of the 21st ACM international conference on Information and knowledge management
Concurrency and Computation: Practice & Experience
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
Hierarchical QR factorization algorithms for multi-core clusters
Parallel Computing
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
As multicore systems continue to gain ground in the high-performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine-grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data (referred to as ‘tiles’). These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out-of-order execution of the tasks that will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can be exploited only at the level of the BLAS operations and with vendor implementations. Copyright © 2008 John Wiley & Sons, Ltd.